fast-vad: 721x Realtime Voice Activity Detection Without a Neural Network

By Prahlad Menon 2 min read

The default assumption in modern voice AI is: when in doubt, use a neural network. Silero VAD is the go-to — a small LSTM trained on a huge multilingual dataset, ONNX runtime, decent accuracy. It works. But for a task as well-defined as “is someone speaking right now?”, a neural network is often overkill.

fast-vad makes the opposite bet: logistic regression, hardcoded weights, SIMD-accelerated DSP, zero model loading. The result is 721x realtime throughput in offline mode and 129x realtime in streaming — while landing within striking distance of Silero on accuracy.


What fast-vad Is

fast-vad is a Rust crate (with Python bindings via PyO3/maturin) for voice activity detection. It supports 16 kHz and 8 kHz audio, processes fixed 32ms frames, and ships as a pip install or cargo add.

The design is deliberately minimal:

audio
  → 32ms frames (512 samples @ 16kHz)
  → Hann window
  → real FFT
  → 8 log-energy bands
  → feature engineering (32 total features)
  → hardcoded logistic regression
  → threshold + temporal smoothing
  → speech / silence

No model file. No ONNX runtime. No dynamic weights. The logistic regression weights and bias are compiled directly into the crate. A fresh process is ready to classify speech in microseconds, not the hundreds of milliseconds it takes to load and warm up an ONNX model.


The 32 Features

Each 32ms frame produces 32 engineered features from 8 frequency bands:

Feature groupCountWhat it captures
Raw band energies8Absolute spectral energy per band
Noise-normalized energies8Energy relative to a running noise floor
First-order deltas8Rate of change frame-to-frame
Second-order deltas8Acceleration — helps catch onsets/offsets

The noise-normalized features are what give it robustness without a neural network. The noise floor is tracked as a running estimate, so the classifier adapts to background noise conditions rather than requiring a clean signal.

All hot loops (windowing, FFT, band-energy math, feature computation) are SIMD-accelerated using the wide crate.


Trained on LibriVAD

The model was trained on LibriVAD — a dataset built from LibriSpeech recordings with silence segments derived from the Musan noise corpus. It’s a clean, well-structured dataset, which is both fast-vad’s strength and its honest caveat: noisy real-world audio (phone calls, street interviews, muffled microphones) will behave differently than the benchmark charts suggest.


The Benchmarks

fast-vad was benchmarked against two widely used baselines: Silero VAD (ONNX runtime path) and webrtcvad.

Throughput (the headline)

SystemThroughput
fast-vad (offline)721x realtime
fast-vad (streaming)129x realtime
webrtcvad63x realtime
Silero VAD (ONNX)1x (baseline)

The offline path gets to 721x because it uses Rayon for parallel frame processing across available cores. The streaming path runs one frame at a time (reusing scratch buffers, no thread pool overhead) and still hits 129x — about 2x faster than WebRTC VAD.

The gap between fast-vad and Silero isn’t incremental. It’s a different cost class entirely.

Accuracy (the honest picture)

AVA-Speech (real-world, messy audio):

SystemF1PrecisionRecall
Silero VAD0.7380.8030.712
fast-vad0.6800.6590.785
webrtcvad0.6690.6190.825

fast-vad doesn’t win on F1 — Silero has a meaningful edge (0.738 vs 0.680) on messy real-world audio. But fast-vad’s recall (0.785) is significantly higher than Silero’s (0.712), which matters if you’re building a pipeline where missed speech is worse than a false positive.

On the LibriVAD test set (cleaner, matched to training data), fast-vad performs stronger. The README is honest about this: “these charts are reference points, not guarantees for your own production data.”


Two Modes: Offline and Streaming

import fast_vad
import soundfile as sf

audio, sr = sf.read("audio.wav", dtype="float32")

# Offline — parallel frame processing, max throughput
vad = fast_vad.VAD(sr)
labels = vad.process(audio)  # returns frame-level speech/silence

# Streaming — one frame at a time, low latency
vad_streaming = fast_vad.VadStateful(sr)
for frame in chunks(audio, 512):
    label = vad_streaming.process_frame(frame)

Four preset modes (Normal, LowBitrate, Aggressive, VeryAggressive) tune the precision/recall tradeoff via the threshold and temporal smoothing parameters. Custom config is also supported.


When to Use It

Good fit:

  • Real-time speech pipelines where CPU cost matters (edge devices, high-throughput servers)
  • Preprocessing at scale — if you’re classifying hours of audio, 721x throughput vs 1x is a real cost difference
  • Voice agent interruption detection — fast enough to catch the start of speech before the user finishes a word
  • Anywhere you want zero model loading time (serverless, cold starts, embedded)

Less ideal:

  • Highly noisy environments where Silero’s neural robustness clearly wins
  • Non-LibriSpeech-style audio where the training distribution is distant from your use case
  • When you need the best possible F1 and have the CPU budget to pay for Silero

Relevance to Voice AI Stacks

In a modern voice agent stack — Whisper or Canary for STT, a local LLM, Kokoro for TTS, LiveKit for transport — VAD is the first gate. Everything downstream depends on correctly identifying when the user is speaking.

fast-vad fits naturally as the front-end VAD in that pipeline: zero startup cost, low CPU overhead, good recall (fewer missed speech segments), and a streaming mode that runs comfortably within a 32ms frame budget. For real-time agent interruption detection specifically, the latency profile is hard to beat.

The Rust crate also integrates cleanly with other Rust audio components or FFI boundaries, and the Python bindings mean you can drop it into a Python voice agent without rewriting anything.


Install

# Python
pip install fast-vad

# Rust
cargo add fast-vad

Repo: github.com/AtharvBhat/fast-vad


The broader lesson here is one the ML community keeps needing to relearn: the right tool for a constrained, well-defined problem is often not a neural network. VAD is a binary classification task on structured spectral features. A logistic regression trained on a decent dataset, combined with good feature engineering and temporal smoothing, gets you most of the way there at a fraction of the cost.

fast-vad is a well-executed example of that principle.