How fast is fast-vad compared to alternatives?

721x realtime (offline) and 129x realtime (streaming), vs WebRTC VAD at 63x and Silero at 1x. About 11x faster than WebRTC VAD and orders of magnitude faster than Silero's ONNX runtime.

How does fast-vad work without a neural network?

Logistic regression with hardcoded weights on 32 engineered features: 8 log-energy bands × (raw + noise-normalized + first-order deltas + second-order deltas). All hot loops are SIMD-accelerated. Zero model loading — weights compiled directly into the crate.

How accurate is fast-vad compared to Silero?

On AVA-Speech (messy real-world audio): Silero F1 0.738, fast-vad F1 0.680. Silero wins on F1, but fast-vad has higher recall (0.785 vs 0.712) — fewer missed speech segments, which matters when missed speech is worse than false positives.

When should I use fast-vad vs Silero VAD?

Use fast-vad for edge devices, high-throughput servers, serverless cold starts, voice agent interruption detection — anywhere CPU cost or startup time matters. Use Silero for highly noisy environments where neural robustness clearly wins.

How do I install fast-vad?

Python: 'pip install fast-vad'. Rust: 'cargo add fast-vad'. Supports 16kHz and 8kHz audio, processes 32ms frames, includes Python bindings via PyO3/maturin.

What's the broader lesson from fast-vad?

The right tool for a constrained, well-defined problem is often not a neural network. VAD is binary classification on structured spectral features — logistic regression with good feature engineering and temporal smoothing gets most of the way there at a fraction of the cost.

fast-vad: 721x Realtime Voice Activity Detection Without a Neural Network

By Prahlad Menon Published 2026-03-10 2 min read

The default assumption in modern voice AI is: when in doubt, use a neural network. Silero VAD is the go-to — a small LSTM trained on a huge multilingual dataset, ONNX runtime, decent accuracy. It works. But for a task as well-defined as “is someone speaking right now?”, a neural network is often overkill.

fast-vad makes the opposite bet: logistic regression, hardcoded weights, SIMD-accelerated DSP, zero model loading. The result is 721x realtime throughput in offline mode and 129x realtime in streaming — while landing within striking distance of Silero on accuracy.

What fast-vad Is

fast-vad is a Rust crate (with Python bindings via PyO3/maturin) for voice activity detection. It supports 16 kHz and 8 kHz audio, processes fixed 32ms frames, and ships as a pip install or cargo add.

The design is deliberately minimal:

audio
  → 32ms frames (512 samples @ 16kHz)
  → Hann window
  → real FFT
  → 8 log-energy bands
  → feature engineering (32 total features)
  → hardcoded logistic regression
  → threshold + temporal smoothing
  → speech / silence

No model file. No ONNX runtime. No dynamic weights. The logistic regression weights and bias are compiled directly into the crate. A fresh process is ready to classify speech in microseconds, not the hundreds of milliseconds it takes to load and warm up an ONNX model.

The 32 Features

Each 32ms frame produces 32 engineered features from 8 frequency bands:

Feature group	Count	What it captures
Raw band energies	8	Absolute spectral energy per band
Noise-normalized energies	8	Energy relative to a running noise floor
First-order deltas	8	Rate of change frame-to-frame
Second-order deltas	8	Acceleration — helps catch onsets/offsets

The noise-normalized features are what give it robustness without a neural network. The noise floor is tracked as a running estimate, so the classifier adapts to background noise conditions rather than requiring a clean signal.

All hot loops (windowing, FFT, band-energy math, feature computation) are SIMD-accelerated using the wide crate.

Trained on LibriVAD

The model was trained on LibriVAD — a dataset built from LibriSpeech recordings with silence segments derived from the Musan noise corpus. It’s a clean, well-structured dataset, which is both fast-vad’s strength and its honest caveat: noisy real-world audio (phone calls, street interviews, muffled microphones) will behave differently than the benchmark charts suggest.

The Benchmarks

fast-vad was benchmarked against two widely used baselines: Silero VAD (ONNX runtime path) and webrtcvad.

Throughput (the headline)

System	Throughput
fast-vad (offline)	721x realtime
fast-vad (streaming)	129x realtime
webrtcvad	63x realtime
Silero VAD (ONNX)	1x (baseline)

The offline path gets to 721x because it uses Rayon for parallel frame processing across available cores. The streaming path runs one frame at a time (reusing scratch buffers, no thread pool overhead) and still hits 129x — about 2x faster than WebRTC VAD.

The gap between fast-vad and Silero isn’t incremental. It’s a different cost class entirely.

Accuracy (the honest picture)

AVA-Speech (real-world, messy audio):

System	F1	Precision	Recall
Silero VAD	0.738	0.803	0.712
fast-vad	0.680	0.659	0.785
webrtcvad	0.669	0.619	0.825

fast-vad doesn’t win on F1 — Silero has a meaningful edge (0.738 vs 0.680) on messy real-world audio. But fast-vad’s recall (0.785) is significantly higher than Silero’s (0.712), which matters if you’re building a pipeline where missed speech is worse than a false positive.

On the LibriVAD test set (cleaner, matched to training data), fast-vad performs stronger. The README is honest about this: “these charts are reference points, not guarantees for your own production data.”

Two Modes: Offline and Streaming

import fast_vad
import soundfile as sf

audio, sr = sf.read("audio.wav", dtype="float32")

# Offline — parallel frame processing, max throughput
vad = fast_vad.VAD(sr)
labels = vad.process(audio)  # returns frame-level speech/silence

# Streaming — one frame at a time, low latency
vad_streaming = fast_vad.VadStateful(sr)
for frame in chunks(audio, 512):
    label = vad_streaming.process_frame(frame)

Four preset modes (Normal, LowBitrate, Aggressive, VeryAggressive) tune the precision/recall tradeoff via the threshold and temporal smoothing parameters. Custom config is also supported.

When to Use It

Good fit:

Real-time speech pipelines where CPU cost matters (edge devices, high-throughput servers)
Preprocessing at scale — if you’re classifying hours of audio, 721x throughput vs 1x is a real cost difference
Voice agent interruption detection — fast enough to catch the start of speech before the user finishes a word
Anywhere you want zero model loading time (serverless, cold starts, embedded)

Less ideal:

Highly noisy environments where Silero’s neural robustness clearly wins
Non-LibriSpeech-style audio where the training distribution is distant from your use case
When you need the best possible F1 and have the CPU budget to pay for Silero

Relevance to Voice AI Stacks

In a modern voice agent stack — Whisper or Canary for STT, a local LLM, Kokoro for TTS, LiveKit for transport — VAD is the first gate. Everything downstream depends on correctly identifying when the user is speaking.

fast-vad fits naturally as the front-end VAD in that pipeline: zero startup cost, low CPU overhead, good recall (fewer missed speech segments), and a streaming mode that runs comfortably within a 32ms frame budget. For real-time agent interruption detection specifically, the latency profile is hard to beat.

The Rust crate also integrates cleanly with other Rust audio components or FFI boundaries, and the Python bindings mean you can drop it into a Python voice agent without rewriting anything.

Install

# Python
pip install fast-vad

# Rust
cargo add fast-vad

Repo: github.com/AtharvBhat/fast-vad

The broader lesson here is one the ML community keeps needing to relearn: the right tool for a constrained, well-defined problem is often not a neural network. VAD is a binary classification task on structured spectral features. A logistic regression trained on a decent dataset, combined with good feature engineering and temporal smoothing, gets you most of the way there at a fraction of the cost.

fast-vad is a well-executed example of that principle.