Insanely Fast Whisper: Transcribe 150 Minutes of Audio in 98 Seconds — Free

By Prahlad Menon 4 min read

OpenAI charges $0.006 per minute to transcribe audio. Google charges $0.024. AWS charges $0.024.

Someone just open sourced a tool that does it for $0. On your own machine. No API key. No cloud bill. No per-minute meter running.

It’s called Insanely Fast Whisper. 8,800 GitHub stars. MIT License. And the benchmarks are not hype.

The Numbers

Run on an NVIDIA A100 against 150 minutes of audio:

ConfigurationTime
Standard Whisper Large v3 (fp32)31 minutes
Large v3 + fp16 + batching + BetterTransformer5 minutes
Large v3 + fp16 + Flash Attention 298 seconds
Distil Whisper + Flash Attention 278 seconds

That’s a 19x speedup over vanilla Whisper. Same model. Same accuracy. Same audio. Just faster.

For context: at OpenAI’s pricing, 150 minutes of audio costs $0.90. At AWS, it’s $3.60. With Insanely Fast Whisper, it costs nothing — just compute you already own.

What It Does

  • One-command transcription of any local file or URL
  • Speaker diarization — knows who said what, not just what was said
  • Translation — transcribe and translate in one pass
  • Word-level timestamps — precise alignment for subtitles or search
  • Works on NVIDIA GPUs and Apple Silicon (M1/M2/M3)
  • Clean JSON output with timestamps, ready for downstream processing

Getting Started

pipx install insanely-fast-whisper

Basic transcription:

insanely-fast-whisper --file-name audio.mp3

With Flash Attention 2 (maximum speed):

insanely-fast-whisper --file-name audio.mp3 --flash True

On Mac (Apple Silicon):

insanely-fast-whisper --file-name audio.mp3 --device-id mps

With speaker diarization (requires a free HuggingFace token):

insanely-fast-whisper --file-name audio.mp3 --hf-token YOUR_HF_TOKEN

Output goes to output.json by default — structured, timestamped, ready for whatever you’re building on top of it.

Who Should Drop Their Subscription

Podcasters paying Descript ($24/month) or Otter.ai ($100/year) to transcribe recordings — if you have a GPU or Apple Silicon, this replaces both.

Journalists and researchers running interviews — diarization means you get speaker-labeled transcripts, not just a wall of text.

Developers building audio pipelines — the JSON output integrates cleanly. No API rate limits, no latency waiting on a cloud round-trip, no cost scaling with volume.

Content creators doing captions — word-level timestamps feed directly into subtitle workflows.

Lawyers and enterprise teams paying thousands for transcription contracts — this runs on-premise, meaning your audio never leaves your infrastructure.

The Honest Caveats

Flash Attention 2 requires a compatible NVIDIA GPU (Ampere or newer — A100, RTX 3090, 4090, etc.). The A100 benchmarks are best-case; a consumer RTX 4090 will be slower but still dramatically faster than vanilla Whisper.

On Mac, --device-id mps works but the MPS backend isn’t fully optimized — expect good but not A100-level performance.

Speaker diarization requires a HuggingFace token (free) and depends on pyannote/speaker-diarization, which works well but isn’t perfect on heavily overlapping speech.

The Bottom Line

The transcription API market exists because running Whisper at scale was hard. Flash Attention 2 and smart batching just made it easy. Insanely Fast Whisper packages that into one CLI command.

If you have a GPU and you’re still paying per minute for transcription, the math stopped making sense.

GitHub: Vaibhavs10/insanely-fast-whisper


Frequently Asked Questions

What is Insanely Fast Whisper? Insanely Fast Whisper is an open-source CLI tool that transcribes audio using OpenAI’s Whisper model locally — no cloud API required. It uses Flash Attention 2 and fp16 batching to achieve up to 19x speedup over standard Whisper, transcribing 150 minutes of audio in under 98 seconds on an NVIDIA A100.

How much does Insanely Fast Whisper cost? Zero. It’s MIT-licensed open source software that runs on your own hardware. You pay nothing per minute, no subscription, no API key. The only cost is your own GPU compute, which you likely already own.

Does Insanely Fast Whisper work on Mac? Yes. It supports Apple Silicon (M1/M2/M3) via the --device-id mps flag. Performance is good but not as fast as NVIDIA A100-class hardware. Flash Attention 2 is not available on MPS, but standard fp16 batching still gives significant speedups.

Can Insanely Fast Whisper identify different speakers? Yes — it supports speaker diarization via pyannote/speaker-diarization. You need a free HuggingFace token to access the diarization model. Output labels each segment by speaker, not just by timestamp.

How does Insanely Fast Whisper compare to OpenAI Whisper API pricing? OpenAI charges $0.006/minute. For 150 minutes of audio that’s $0.90. Google Speech-to-Text charges $0.024/minute — $3.60 for the same audio. Insanely Fast Whisper costs $0 per minute. At any meaningful transcription volume, the savings are substantial.

What GPU do I need to run Insanely Fast Whisper with Flash Attention 2? Flash Attention 2 requires an NVIDIA GPU with Ampere architecture or newer — RTX 3080/3090/4090, A100, H100. Older GPUs (Turing, Pascal) can run Insanely Fast Whisper but without Flash Attention 2, falling back to BetterTransformer for optimization.