Vapi's Humanness Index: The Open Benchmark for Voice AI That Actually Sounds Human

By Prahlad Menon 4 min read

Picking a TTS model for voice agents has always been a guessing game. Vendor demos sound great, but how do you know which model will actually fool your users into thinking they’re talking to a human?

Vapi just released an answer: The Humanness Index — an open, crowdsourced benchmark that ranks voice models on a single question: which one sounds more human?

What Makes This Different

Most voice benchmarks measure technical metrics — MOS scores, word error rates, latency. The Humanness Index measures something simpler: can listeners tell if it’s AI?

Here’s the methodology:

  1. Same voice, every model — They clone one conversational voice onto every model being tested. This eliminates the “cherry-picked demo voice” problem.

  2. Blind A/B battles — Listeners hear two voices reading the same customer support line. No labels, no hints. Pick whichever sounds more human.

  3. Human baseline at 100 — A real person reads the same lines, anchoring the scale. Every AI model scores as a percentage of that human benchmark.

  4. Elo ranking — Votes feed into an Elo system (like chess rankings), so scores reflect head-to-head performance, not just thumbs-up counts.

The result: a single number that tells you “this model sounds X% as human as an actual person.”

Current Rankings (June 2026)

Based on 721 votes across 16 models and 7 providers:

RankModelHumannessLatencyCost
BaselineHuman100
#1xAI Grok TTS90460ms$15/1M chars
#2xAI Grok TTS (Streaming)86285ms$15/1M chars
#3Cartesia Sonic 3.574128ms$50/1M chars
#3Canopy Orpheus74Open source
#5ElevenLabs Turbo v2.566265ms$50/1M chars
#6ElevenLabs Eleven v366758ms$100/1M chars
#7ElevenLabs Flash v2.564197ms$50/1M chars
#8MiniMax Speech 2.563325ms$60/1M chars

Key takeaways:

  • xAI’s Grok TTS is the current humanness leader at 90/100
  • Cartesia Sonic 3.5 offers the best latency (128ms) with decent humanness (74)
  • Canopy Orpheus matches Cartesia’s humanness and is open source
  • ElevenLabs models cluster in the 64-66 range — solid but not leading

Why Latency Matters

The benchmark also measures time-to-first-audio. A voice that lags breaks the conversation, no matter how human it sounds.

The sweet spot is top-right on their chart: human-sounding AND fast. Right now that’s:

  • Cartesia Sonic 3.5: 74 humanness, 128ms latency
  • ElevenLabs Flash v2.5: 64 humanness, 197ms latency
  • xAI Grok Streaming: 86 humanness, 285ms latency

If you’re building real-time voice agents, latency under 300ms is critical for natural conversation flow.

How to Use This Practically

For voice agent developers:

  1. Start with the leaderboard — Before committing to a provider, check current rankings at humannessindex.vapi.ai

  2. Vote yourself — The benchmark lets you participate in blind A/B tests. After 10-20 comparisons, you’ll develop your own intuition for what “human” sounds like

  3. Consider the tradeoffs:

    • Need maximum humanness? → xAI Grok TTS
    • Need speed + quality? → Cartesia Sonic 3.5
    • Budget-conscious + open source? → Canopy Orpheus
    • Already on ElevenLabs? → Turbo v2.5 is your best option there

For model providers:

The benchmark only includes models that support voice cloning (necessary for fair comparison). If your model isn’t listed, contact humannessindex@vapi.ai.

The Open Source Angle

Everything is on GitHub: VapiAI/humanness-index

The repo includes:

  • The full Next.js site
  • Vote backend and Elo engine
  • Model registry
  • Clip generation pipeline
  • 50-trial TTFB benchmark

You can run it locally with Bun:

bun install
bun run dev  # http://localhost:3000

Vote data and standings are CC BY 4.0, so you can build on top of the rankings in your own projects.

What This Means for Voice AI

The Humanness Index solves a real problem: vendor claims are unreliable, and technical benchmarks don’t capture “does it fool humans?”

For the first time, we have an objective (well, crowdsourced-subjective) answer to “which TTS sounds most human” — updated live as votes come in.

If you’re building voice agents, bookmark the leaderboard. If you’re evaluating TTS providers, run your own blind tests using their methodology. And if you care about advancing the field, cast some votes.

Links: