Why Only 3 % of Pilots Reach Production : 6 Bottlenecks I Heard This Week

Jun 16, 2025

VCs may be “keeping an ear out for Voice-AI startups” (The Information).
But the founders and the VCs I spoke with this week were too busy fighting latency and privacy fires to enjoy the hype.
Here’s what I learned, and the numbers to prove it.

Glossary (30-sec cheat-sheet)

TermPlain English

RTT Round-trip time: caller stops talking → bot starts replying.

ASR / TTS Speech-to-text / text-to-speech.

WERWord-error rate : the % of words the transcript gets wrong.

Pipeline The whole chain: mic ➜ ASR ➜ LLM ➜ TTS.

(Skim this if acronyms pop up later.)

TL;DR (read in 20 sec)

Weekly rebuilds. Most teams redeploy every 7-10 days to keep pace with new ASR/TTS releases.
Multilingual ≠ free. Adding Spanish or Hindi tacks +300 ms RTT onto every turn.
VC blind spot. Even VCs admitted they don’t benchmark RTT across portfolio companies.
Voiceprints are PII. Raw audio now triggers biometric-privacy reviews—encrypting transcripts alone isn’t enough.
ASR ➜ LLM ➜ TTS still rules. Direct speech-to-speech models add jitter and hamper prompt control.
Evals must evolve. Word-error rate misses “conversation feel”; founders are designing custom test sets.

1 | Market Pulse (big picture)

3 numbers framing the moment

$0.01/min: modern APIs make bot airtime cheaper than a postage stamp.
<800 ms RTT: anything slower feels robotic; founders now chase <300 ms to enable natural interruptions.
3 % pilot-to-prod: A Coval survey of 40 enterprise POCs shows only 3 % ever scale - latency, language gaps, and privacy kill most deals.

Money is flowing, but infra is still the choke-point.

2 | Field Notes (ground truth)

“Friday is VOps day.”
Whisper v3 Tuesday → prompts Wednesday → redeploy Friday.
Weekly release cadence is the new DevOps baseline; if your team ships monthly, you’re three model cycles behind by quarter-end.
Multilingual tax = +306 ms.
Retail bot 640 → 946 ms; hang-ups +15 %.
That extra third of a second sounds trivial until you watch live dashboards: callers interrupt, LLM loses state, CSAT plummets. Dual-stack ASR or price the churn into your CAC.
VC latency blind spot.
VCs mentioned logging CSAT and call minutes, not RTT.
I’ve stopped sending decks that lack a latency histogram - mean averages hide tail pain where reputations die.
“Protect the voice.”
Raw audio now AES-encrypted at rest.
Voiceprints are literally biometric ID; expect HIPAA-like clauses in 2025 procurement even outside healthcare.
Speech-to-speech on the bench.
~150 ms faster but kills prompt control & barge-in.
Until S2S supports function-calling and partial decoding, it’s a demo not a deployment.
Custom eval sets beat plain WER.
500-utterance “awkward-pause” suite grades smoothness.
If a founder can’t show a bespoke eval harness, they’re stuck in sandbox mode.
Shift from $/min to $/call.
Two pilots now charge $0.75 per resolved call vs. $0.011/min.
Outcome pricing pushes infra risk back onto the builder - latency & error budgets become line-items in gross margin, not “nice to fix later.”

3 | Metrics Corner

Latency vs Cost Benchmark (US-West-1, 30-word utterance)

*Retell bundles ASR + TTS + telephony; price varies by tier.

Inference dollars are cheap, and latency is priceless. I gladly pay an extra half-cent per minute to drop from 800 ms to 650 ms; the conversion lift repays that inside a single support cycle.

5 | Framework - The Voice-AI Flywheel (how leaders compound data)

Capture ➜ Correct ➜ Retrain ➜ Deploy ➜ Repeat

Capture every call’s audio + partial transcripts.
Correct 3-5 % via human QA nightly.
Retrain on those fixes; clone a smaller multilingual ASR if needed.
Deploy updated model Monday morning - latency and WER drop.
Repeat - your proprietary data moat grows daily.

Teams that label just 5 % of calls nightly cut WER 40 % in six weeks on a single GPU.

Shail’s Substack

Discussion about this post