Vapi's Humanness Index: The Open Benchmark for Voice AI That Actually Sounds Human
Picking a TTS model for voice agents has always been a guessing game. Vendor demos sound great, but how do you know which model will actually fool your users into thinking they’re talking to a human?
Vapi just released an answer: The Humanness Index — an open, crowdsourced benchmark that ranks voice models on a single question: which one sounds more human?
What Makes This Different
Most voice benchmarks measure technical metrics — MOS scores, word error rates, latency. The Humanness Index measures something simpler: can listeners tell if it’s AI?
Here’s the methodology:
-
Same voice, every model — They clone one conversational voice onto every model being tested. This eliminates the “cherry-picked demo voice” problem.
-
Blind A/B battles — Listeners hear two voices reading the same customer support line. No labels, no hints. Pick whichever sounds more human.
-
Human baseline at 100 — A real person reads the same lines, anchoring the scale. Every AI model scores as a percentage of that human benchmark.
-
Elo ranking — Votes feed into an Elo system (like chess rankings), so scores reflect head-to-head performance, not just thumbs-up counts.
The result: a single number that tells you “this model sounds X% as human as an actual person.”
Current Rankings (June 2026)
Based on 721 votes across 16 models and 7 providers:
| Rank | Model | Humanness | Latency | Cost |
|---|---|---|---|---|
| Baseline | Human | 100 | — | — |
| #1 | xAI Grok TTS | 90 | 460ms | $15/1M chars |
| #2 | xAI Grok TTS (Streaming) | 86 | 285ms | $15/1M chars |
| #3 | Cartesia Sonic 3.5 | 74 | 128ms | $50/1M chars |
| #3 | Canopy Orpheus | 74 | — | Open source |
| #5 | ElevenLabs Turbo v2.5 | 66 | 265ms | $50/1M chars |
| #6 | ElevenLabs Eleven v3 | 66 | 758ms | $100/1M chars |
| #7 | ElevenLabs Flash v2.5 | 64 | 197ms | $50/1M chars |
| #8 | MiniMax Speech 2.5 | 63 | 325ms | $60/1M chars |
Key takeaways:
- xAI’s Grok TTS is the current humanness leader at 90/100
- Cartesia Sonic 3.5 offers the best latency (128ms) with decent humanness (74)
- Canopy Orpheus matches Cartesia’s humanness and is open source
- ElevenLabs models cluster in the 64-66 range — solid but not leading
Why Latency Matters
The benchmark also measures time-to-first-audio. A voice that lags breaks the conversation, no matter how human it sounds.
The sweet spot is top-right on their chart: human-sounding AND fast. Right now that’s:
- Cartesia Sonic 3.5: 74 humanness, 128ms latency
- ElevenLabs Flash v2.5: 64 humanness, 197ms latency
- xAI Grok Streaming: 86 humanness, 285ms latency
If you’re building real-time voice agents, latency under 300ms is critical for natural conversation flow.
How to Use This Practically
For voice agent developers:
-
Start with the leaderboard — Before committing to a provider, check current rankings at humannessindex.vapi.ai
-
Vote yourself — The benchmark lets you participate in blind A/B tests. After 10-20 comparisons, you’ll develop your own intuition for what “human” sounds like
-
Consider the tradeoffs:
- Need maximum humanness? → xAI Grok TTS
- Need speed + quality? → Cartesia Sonic 3.5
- Budget-conscious + open source? → Canopy Orpheus
- Already on ElevenLabs? → Turbo v2.5 is your best option there
For model providers:
The benchmark only includes models that support voice cloning (necessary for fair comparison). If your model isn’t listed, contact humannessindex@vapi.ai.
The Open Source Angle
Everything is on GitHub: VapiAI/humanness-index
The repo includes:
- The full Next.js site
- Vote backend and Elo engine
- Model registry
- Clip generation pipeline
- 50-trial TTFB benchmark
You can run it locally with Bun:
bun install
bun run dev # http://localhost:3000
Vote data and standings are CC BY 4.0, so you can build on top of the rankings in your own projects.
What This Means for Voice AI
The Humanness Index solves a real problem: vendor claims are unreliable, and technical benchmarks don’t capture “does it fool humans?”
For the first time, we have an objective (well, crowdsourced-subjective) answer to “which TTS sounds most human” — updated live as votes come in.
If you’re building voice agents, bookmark the leaderboard. If you’re evaluating TTS providers, run your own blind tests using their methodology. And if you care about advancing the field, cast some votes.
Links:
- Live benchmark: humannessindex.vapi.ai
- GitHub: VapiAI/humanness-index
- Whitepaper: humannessindex.vapi.ai/the-humanness-index-whitepaper.pdf