Insights
/
feb 16, 2025
How AI Voice Agents Work: Technology Explained
Understand how AI voice agents work. Learn the 5-stage pipeline, latency challenges, streaming architecture, and why scaling to 10,000 concurrent calls matters.
/
AUTHOR

Gracia Perkin

A customer calls. Within 200 milliseconds, the system transcribes their speech, understands what they want, retrieves data, and responds naturally. No human involved. How does this happen?
Most think voice agents are chatbots that talk. Wrong. They're fundamentally different systems built on a cascading architecture where each component specializes in one conversation stage.
The Five-Stage Pipeline, How Each Component Specializes?
AI Voice agents use five sequential systems, not one. Each converts input to output for the next stage. Think assembly line: audio in, intelligent response out.
Most production voice agents use cascading pipelines because they're modular, testable, and flexible. You optimize each stage independently.
Stage 1: Speech Input The system listens, captures audio, filters background noise, and detects when the user speaks.
Stage 2: Speech Recognition (ASR) Automatic Speech Recognition converts audio to text. Modern systems achieve 4.9% word error rate. But errors cascade: misheard "Thursday" as "3:00" breaks intent parsing downstream. Streaming ASR solves this—processing audio chunks in real-time reduces latency from 500ms to 100-200ms.
Stage 3: Understanding (NLU & LLM) Text enters Natural Language Understanding. Traditional NLU is rule-based. Modern systems use Large Language Models that understand context and reason through multi-step requests. Dialogue management tracks conversation state and decides whether to proceed or ask clarifying questions. Tool use enables the system to execute API calls, schedule_meeting(customer_id=12345, time="2pm")—rather than just generate text.
Stage 4: Response Generation The system decides what to say and structures it for natural delivery: sentence breaks, pauses, emphasis. It fetches data from APIs/databases if needed.
Stage 5: Speech Synthesis (TTS) Text-to-Speech converts response to audio. Modern neural TTS captures prosody and emotion. But speed is critical. Slow TTS kills conversation. Users expect response within 500ms from speech end.
Three Core Technologies
Speech recognition, understanding, and synthesis enable machines to listen, interpret, and respond to human language naturally. Together, these technologies power virtual assistants, chatbots, and smart communication systems.
Speech Recognition (ASR)
Automatic Speech Recognition (ASR) analyzes sound waves to identify words. Modern systems achieve 4.9% word error rate. But errors cascade: misheard "Thursday" becomes "3:00," breaking downstream logic.
Acoustic drift is the real problem. Models trained on clean American English fail on Spanish-accented English or noisy environments. One company: 95% accuracy in lab, 72% in production with actual customers.
Streaming ASR processes audio chunks in real-time instead of waiting for complete speech. This reduces latency from 500ms to 100-200ms.
Natural Language Understanding (NLU)
Text enters NLU to extract intent and entities. Traditional NLU uses keyword matching. Modern systems use Large Language Models that understand context and multi-step requests.
LLMs handle ambiguity. When user says "That was expensive," traditional NLU can't determine what "that" refers to. LLMs understand from conversation history.
Tool use (function calling) makes execution possible. LLM doesn't just generate text. It outputs structured commands: schedule_meeting(customer_id=12345, time="2pm"). This enables multi-step tasks chatbots cannot perform.
Speech Synthesis (TTS)
TTS converts text to audio. Modern neural TTS captures prosody—rhythm, emphasis, emotional tone. Old TTS sounded robotic. New TTS sounds human.
Speed is critical. Slow TTS breaks conversation flow. Streaming TTS plays the first sentence while generating the second, reducing perceived latency.
Latency: The Silent Killer
Why 500 Milliseconds Matter
Humans expect responses in 300-500ms. Delays beyond 500ms feel awkward. Beyond 1.2 seconds, users hang up.
Real production systems take 800ms-2 seconds. Too slow. Latency compounds across the pipeline.
Where Delays Hide: Audio buffering (20-100ms) → ASR (100-200ms with streaming) → LLM thinking (200-500ms, largest component) → TTS generation (50-200ms) → Network calls (50-150ms).
Total: 800-1500ms in most systems.
How to Fix It: Streaming architecture processes in parallel. ASR starts while user speaks. NLU begins understanding partial transcripts. TTS starts generating while LLM reasons. Sentence-level streaming plays first sentence audio while second sentence generates.
Quantized LLMs run 3-5x faster than full-precision with minimal accuracy loss. The latency gain justifies the trade-off.
Real impact: baseline 1500ms feels broken. Same components with optimized architecture at 400ms feels natural.
Pipeline vs. End-to-End Architecture
Why Most Systems Use Pipelines?
Cascading pipelines (audio → ASR → NLU → NLG → TTS → audio) dominate production. Each component is independent, testable, replaceable.
Speech-to-speech (S2S) models take audio directly and generate audio output. No text conversion. Examples: Qwen-omni, Moshi. Lower latency. Better emotion preservation. But immature technology.
Pipeline advantages: modularity, flexibility, maturity. Pipeline disadvantages: latency compounds, information loss in conversion.
Current reality: 95% of production systems use pipelines. S2S is the future, 12-24 months away.
Modern enterprise platforms such as Zelu AI still rely heavily on optimized pipeline architectures because they offer better scalability, observability, and production reliability.
Metrics That Actually Matter
WER (word error rate) seems important but isn't. A 5% WER doesn't mean 5% task failure. Errors cascade.
Real metrics:
Task completion rate (did agent complete the task without escalation?)
Latency: first response time should be 300-700ms
Slot filling accuracy (did system capture the right information?)
CSAT: customer satisfaction driven by latency + accuracy + naturalness
Industry baseline for task completion: 45-65%. Top performers: 85-95%.
Scaling to Thousands of Calls
Scaling from 10 to 10,000 concurrent calls requires architectural shift: stateless services. Stateless means each ASR/NLU/TTS instance is identical.
No instance carries state. User context lives in separate database. If one instance fails, another handles the call instantly.
Load balancing distributes calls across instances. Health checks remove unhealthy instances. Geographic distribution routes users to nearest datacenter.
Infrastructure that enables this scaling: ten instances per component at peak load, with ability to spin up new instances in seconds if concurrent calls exceed capacity.
Final Thoughts
Voice agents aren't magic. They're five systems orchestrated under tight latency constraints. The difficulty isn't the technology, it's doing it naturally in under 500ms, handling real-world acoustic conditions, maintaining context across turns, and scaling to thousands of concurrent users.
Understanding this architecture tells you where optimization matters (usually LLM inference, not ASR), when to build custom versus use platforms, and why latency compounds.
FAQs
What's the typical latency we should expect?
Natural conversation requires <500ms response time from speech end to first audio output. Most production systems hit 800ms-1.5 seconds. Optimized systems can achieve 400-600ms with streaming architecture and quantized LLMs. Anything beyond 1.2 seconds causes users to hang up or interrupt.
Why does acoustic drift happen and how do we prevent it?
Models trained on clean, standard American English fail on Spanish-accented English, noisy call centers, or technical jargon. Real example: 95% accuracy in lab, 72% in production. Prevention: test with actual customer audio under real conditions, use accent-robust models trained on multilingual data, and deploy noise-cancellation for realistic acoustic environments.
Should we wait for speech-to-speech models or build with pipelines now?
Build with pipelines now. S2S models are 12-24 months from maturity. Pipelines give you flexibility, modularity, and proven tooling. You can migrate to S2S later if latency demands require it. Waiting means delaying deployment and losing competitive advantage.
How do we monitor if latency is actually our bottleneck?
Instrument each component: ASR time, NLU time, LLM inference time, TTS time, network time. If total exceeds 700ms, users perceive slowness. If exceeds 1.2 seconds, they hang up. Usually LLM inference is slowest (200-500ms). Optimize there first with quantization or faster models. Fast ASR and TTS don't matter if LLM is slow.


