FFmpeg's First AI Filter Delivers On-Device Whisper Transcription

FFmpeg, the open-source Swiss Army knife of multimedia processing, now speaks. A freshly merged whisper audio filter brings OpenAI’s Whisper model directly into FFmpeg’s filter graph, enabling single-command automatic speech recognition (ASR) without any third-party cloud service. This is FFmpeg’s first formal AI integration, and it turns the decades-old toolkit into a privacy-respecting transcription engine that can output plain text, SRT subtitles, or structured JSON.

Why this matters: FFmpeg meets AI

For years, transcribing audio meant extracting a stream, piping it to a separate tool like whisper.cpp, and then merging the result back into a workflow. The new af_whisper filter collapses all that into one step. It lives inside libavfilter, the same machinery that handles cropping, scaling, and noise reduction. Developers can now write ffmpeg -i input.mp4 -af "aformat=sample_rates=16000:channel_layouts=mono,whisper=model=/path/ggml-base.en.bin:language=en:format=srt:destination=output.srt" -f null - and get subtitles generated on the fly.

The integration is more than a convenience. It signals that FFmpeg—long the foundation of video pipelines from desktop apps to cloud transcoding farms—is now a platform for inline machine learning. Future filters could handle speaker diarization, real-time translation, or noise suppression using the same pattern. The whisper filter was merged ahead of the planned FFmpeg 8.0 release, and it uses the well-known whisper.cpp runtime under the hood.

Under the hood: How the whisper filter operates

At its core, the filter expects audio in a very specific shape: 16 kHz sample rate and mono channels. That’s non-negotiable—Whisper’s models were trained on that configuration. A typical chain therefore begins with aformat to resample and downmix. After that, the filter loads a ggml-format model file, processes the audio in configurable chunks, and emits transcription as frame metadata (lavfi.whisper.text and lavfi.whisper.duration). Where that text ends up depends on the destination option; it can be a file path, an HTTP endpoint, or simply passed along as side-data for downstream processing.

The filter exposes a rich set of options:

model: path to the whisper.cpp model file (required).
language: language code or auto for automatic detection.
queue: buffer size in milliseconds; larger queues improve accuracy but add latency.
use_gpu / gpu_device: toggle GPU acceleration and select the device.
format: text, srt, or json for output.
destination: AVIO-style URL where the transcript should be written.
vad_model, vad_threshold, vad_min_speech_duration, vad_min_silence_duration: voice activity detection parameters that let the filter split the queue intelligently, avoiding wasted inference on silence.

Those last VAD settings are critical for live scenarios. Without them, the filter would be forced to transcribe every chunk of audio whether someone is speaking or not, burning CPU or GPU cycles for no reason.

Building it yourself: a practical checklist

The whisper filter is available in the development tree and will be part of FFmpeg 8.0 when that releases. If you want it today, you’ll need to compile from source. Here’s the rough path:

Install whisper.cpp – clone the repository, download a ggml-converted model (e.g., base.en.bin), build and install the library. GPU backends (CUDA, Vulkan, Metal) require extra flags during cmake.
Configure FFmpeg with --enable-whisper and ensure the compiler can find whisper.cpp’s include and lib directories.
Verify the filter: ffmpeg --help filter=whisper should list all options.

There is no official Windows binary yet that includes whisper support. The most straightforward way for Windows users is to build inside WSL2 or a Linux container, where GPU acceleration can also be leveraged if the host has an NVIDIA GPU and the appropriate drivers are installed. Native Windows compilation is theoretically possible but currently undocumented; community experimenters have reported success with MSYS2 and careful library management, but proceed with patience.

One-command transcription examples

Once built, the filter slots into standard FFmpeg pipelines. Here are a few patterns:

Batch SRT generation

ffmpeg -i lecture.mp4 -vn -ac 1 -ar 16000 -af "whisper=model=ggml-medium.en.bin:language=en:format=srt:destination=lecture.srt" -f null -

Live captioning with VAD

ffmpeg -i rtmp://source/live -af "aformat=sample_rates=16000:channel_layouts=mono,whisper=model=ggml-small.bin:language=auto:queue=500:format=json:destination=localhost:8080,vad_model=path/to/silero:threshold=0.4" -c:v copy -f flv rtmp://restream

Exporting JSON metadata

ffmpeg -i podcast.mp3 -af "whisper=model=ggml-tiny.bin:format=json:destination=transcript.json" -f null -

In all cases, the -f null - tells FFmpeg to discard the transcoded output because we only care about the side-car transcript.

Performance, latency, and model choices

Whisper models come in several sizes, and the filter inherits their speed/accuracy trade-offs:

Model	Parameters	Approx. VRAM (GPU)	Relative Speed	Use Case
Tiny	39 M	~150 MB	Fastest	Low-latency live captions
Base	74 M	~280 MB	Fast	Basic transcription
Small	244 M	~1 GB	Moderate	Podcasts, meetings
Medium	769 M	~3 GB	Slow	High-accuracy offline
Large	1550 M	~6 GB	Very slow	Studio-level archiving

For near-real-time use, stick with tiny or base models and enable GPU acceleration. The queue size then becomes the main dial: smaller queues reduce delay but can hurt word error rate because Whisper works best with sufficient context. A queue of 500–1000 ms often strikes a good balance, especially when VAD ensures the buffer contains actual speech.

Without a GPU, even a modern CPU will struggle to keep up with anything beyond the small model in real time. On a desktop with an RTX 3060, the medium model can run at about 1.5× real time, making it acceptable for batch processing but still too slow for interactive use.

Real-world use cases

Automated subtitles: Generate SRT or WebVTT files as part of a video transcoding pipeline.
Podcast transcription: Batch-process an archive of MP3 files and produce searchable JSON transcripts.
Live captioning: Feed a low-latency stream through the filter and push text to a browser using WebSocket.
Metadata indexing: Attach time-stamped text to media files so that an internal CMS can enable full-text search.
Privacy-first compliance: Keep all audio data on-premises while still producing transcripts for audits or discovery.

Windows and cross-platform realities

The filter’s documentation and community recipes are almost exclusively Linux-centric. For Windows developers, three paths are viable today:

WSL2 with GPU passthrough: Build and run the entire pipeline inside Ubuntu under WSL. This gives access to NVIDIA CUDA drivers and is the path of least resistance.
Docker on Windows: Use a pre-built container image (when community ones appear) that bundles FFmpeg + whisper.cpp. This simplifies dependency hell but may add a slight virtualization overhead.
Native Windows build: Requires installing whisper.cpp via a package manager like vcpkg, then configuring FFmpeg with MSVC or MinGW. Library discovery is finicky, and GPU acceleration is even trickier. This is a frontier for enthusiasts; expect to document your journey if you attempt it.

As demand grows, expect third-party maintainers to release Windows binaries that include the whisper filter—much like how gyandev offers full FFmpeg builds for Windows today.

Security and privacy

The on-device nature of the whisper filter is a double-edged sword. On one hand, sensitive audio never leaves your server or workstation. That’s a major advantage over cloud-based ASR services. On the other hand, transcripts themselves are still data that needs protection. If you write them to disk, ensure proper access controls. If you send them to a remote destination, encrypt the transport and authenticate the endpoint.

There’s also the risk of misuse. Lowering the barrier to mass transcription can enable bulk harvesting of spoken content from public streams. Administrators deploying the filter in a shared environment should add rate limiting and logging around the transcription endpoint.

Accuracy caveats: automatic transcripts often get names, numbers, and domain-specific jargon wrong. In legal or medical contexts, human verification remains essential. The filter does not expose confidence scores per word in its current form, so downstream quality assurance is largely binary.

Limitations and what’s next

No built-in speaker diarization: The filter transcribes a single mono channel. If your input has multiple speakers, you’ll need to separate them beforehand.
Model licensing: The Whisper models are open-weight, but redistribution and commercial use may have restrictions. FFmpeg does not ship models; you provide them yourself.
Real-time ceiling: Even with GPU acceleration, the largest models are impractical for live use. The community will likely optimize the whisper.cpp backend further, but for now, set expectations accordingly.
Release timing: While merged, the feature is tied to the FFmpeg 8.0 release cycle. Build from Git if you can’t wait.

The precedent set by this filter is significant. FFmpeg’s maintainers have demonstrated that AI models can live comfortably inside the filter graph. Expect to see future proposals for audio enhancement, voice isolation, and even on-the-fly translation filters that follow the same integration pattern.

Conclusion

FFmpeg’s new whisper filter is a leap forward for the project, bringing on-device ASR into a tool that already touches millions of media workflows. It simplifies transcription pipelines, keeps data local, and paves the way for a new class of intelligent filters. For Windows users, the immediate path is through WSL or containers, but native support will likely follow as the community rallies around this feature. If you process audio at any scale, it’s time to pull the latest source and start experimenting.