How to Build a Voice Agent in 2026: Vapi, Retell, Latency, and Deployment

May 2, 2026 · Editorial Team · 9 min read · voice-agents ai-agents tutorial

Voice agents are not chatbots with audio attached. The moment a real phone call is involved, the constraints change entirely. Latency that would be acceptable in a text interface, two seconds, say, becomes a conversation killer on a phone call. Users hang up. The entire experience collapses. Building a voice agent that actually works in production means treating latency as a first-class constraint from day one, before you write a line of application code.

This guide covers how to build a voice AI agent in 2026: choosing between Vapi and Retell, picking an LLM that fits the latency budget, wiring up telephony, and getting to a working deployment.

What you're actually building

A voice agent has four components in the signal chain:

Speech-to-text (STT): Converts the caller's audio into text.
Language model: Reads the transcript and produces a text response.
Text-to-speech (TTS): Converts the response back into audio.
Telephony layer: Handles the actual phone call, SIP, PSTN, call routing.

The total latency budget from the moment the caller stops speaking to the moment your agent starts responding is ideally under 800 milliseconds. In practice, 1.0 to 1.3 seconds is where most users stop noticing a difference from a human. Beyond 1.5 seconds, the pause starts to feel wrong. Beyond 2 seconds, callers think the call dropped.

That budget has to cover all four steps. This is why model choice, STT model choice, and infrastructure region all matter in ways they don't for text-based agents.

Vapi vs Retell: which platform to start with

Both Vapi and Retell are voice AI platforms that handle the telephony and audio pipeline for you, so you can focus on the agent logic. They're the two most production-ready options in 2026. Here's where they actually differ.

Vapi has a larger ecosystem of integrations and a more flexible architecture. You can plug in your own LLM endpoint, your own STT provider, and your own TTS voice. The platform is designed around the assumption that you'll want to customize components. Vapi also has better tooling for phone number management, call routing, and multi-step workflows. For teams building complex outbound campaigns or multi-agent call flows, Vapi's flexibility is worth it.

The downside of Vapi is that flexibility means more configuration. Getting a minimal agent running takes more setup than Retell, and the documentation, while improving, has some gaps.

Retell trades flexibility for speed of implementation. Out of the box, Retell gives you solid default STT and TTS configurations, a clean API, and a simpler mental model. You can have a working agent in an afternoon. Retell is tighter about which LLMs you can use and how you customize the audio pipeline, but for most use cases those defaults are fine.

If you're building a first voice agent or want to validate the concept quickly, start with Retell. If you need full control over every component or are building at scale with complex requirements, use Vapi.

Pricing as of May 2026: Vapi charges approximately $0.05 per minute of call time on top of the underlying LLM and TTS costs. Retell is similar at $0.04-0.05 per minute. The difference is small enough that it shouldn't drive your platform choice.

Setting up a basic Retell agent

Here's the minimum to get a Retell voice agent accepting calls:

import retell
from retell import Retell

client = Retell(api_key="your_retell_api_key")

# Create an LLM configuration
llm = client.llm.create(
    model="gpt-4o-mini",  # fast, low-latency model
    general_prompt="""You are a helpful assistant for Acme Corp.
    You answer questions about our products and services.
    Keep responses short, this is a phone conversation.""",
    general_tools=[],
)

# Create the agent
agent = client.agent.create(
    llm_websocket_url=llm.llm_websocket_url,
    agent_name="Acme Support",
    voice_id="11labs-Adrian",
    language="en-US",
    response_engine={
        "type": "retell-llm",
        "llm_id": llm.llm_id,
    },
    ambient_sound="office",
    enable_backchannel=True,
)

print(f"Agent ID: {agent.agent_id}")

The enable_backchannel setting adds filler sounds ("mm-hmm", "I see") while the agent is processing, which reduces the perceived silence during LLM inference. This single setting meaningfully improves the user experience.

LLM choice for voice: latency over capability

The most capable model is not the right model for a voice agent. GPT-4o and Claude 4 Opus produce better text than GPT-4o mini, but their time-to-first-token is 600-900ms, which alone exhausts most of your latency budget before TTS even starts.

The practical choices for production voice agents in 2026:

GPT-4o mini, The most common choice for voice. Time-to-first-token around 200-400ms. Capable enough for most customer-facing use cases. This is the default for most Retell deployments for a reason.

Groq-hosted Llama 3.3 70B, Groq's inference hardware is optimized for speed. You get 150-300ms time-to-first-token for a model that's meaningfully more capable than GPT-4o mini. The tradeoff is occasional availability issues during peak load. Worth testing if you need more reasoning capability.

Claude 3.5 Haiku, Anthropic's fastest hosted model. Performs well on instruction-following and handles nuanced conversation better than GPT-4o mini at comparable latency. A good option if your use case requires the agent to handle complex, multi-turn conversations.

Gemini 2.0 Flash, Google's low-latency offering, with strong multilingual performance. If you're building for non-English markets, Flash is worth serious consideration.

The key rule: test your LLM latency from the same region as your telephony infrastructure. Latency numbers from benchmarks may not reflect your actual deployment.

Prompting for voice: shorter is better

A voice agent prompt needs a different approach than a chat prompt. On a phone call, the user can't scroll up. They can't re-read your agent's response. They're listening in real time, often in a noisy environment.

Rules that actually matter in practice:

Keep responses to 1-3 sentences. If your agent is producing 200-word responses, the prompt is wrong. Add an explicit instruction: "Keep all responses under 40 words. Use conversational language."

Avoid lists and structure. Bullet points and numbered lists work in text. Read aloud, they sound like someone reciting a manual. Tell the agent to convert structured information into flowing speech.

Handle interruptions explicitly. Users will talk over the agent. Most platforms detect this and cut the audio stream. Your prompt needs to account for the agent resuming gracefully after an interruption rather than finishing its previous response.

State management matters more than in text. The agent needs to track what it has already said. A caller asking the same question twice should get the same answer, not "As I mentioned..." followed by a different version. Structure your system prompt to maintain explicit state.

Telephony: connecting to real phone numbers

You have three main options for getting your voice agent onto actual phone calls:

Vapi or Retell native phone numbers: Both platforms let you purchase phone numbers directly through their API. This is the fastest path. Numbers are US and Canada by default, with international options available. For outbound calling, you specify the number as the caller ID when initiating a call.

Twilio SIP trunk: If you already use Twilio or need more control over call routing, you can connect Twilio to either platform via SIP. This gives you access to Twilio's number inventory and programmable call routing. The setup takes more work but is worth it if you need features like time-based routing, IVR fallback, or call recording in a specific S3 bucket.

Bring your own SIP: For enterprises with existing PBX infrastructure, both Vapi and Retell support SIP integration. Your voice agent becomes one endpoint in an existing call routing setup.

For most teams building from scratch, start with the native phone numbers. You can always migrate to a more complex telephony setup later.

# Initiate an outbound call with Retell
call = client.call.create_phone_call(
    from_number="+14155551234",   # your Retell number
    to_number="+14085559876",     # the person you're calling
    agent_id=agent.agent_id,
    retell_llm_dynamic_variables={
        "customer_name": "Sarah",
        "account_status": "overdue",
    },
)
print(f"Call ID: {call.call_id}")

Dynamic variables let you pass context from your application into the agent prompt at call time. This is how you personalize calls without building a separate agent per use case.

Handling tool calls in voice agents

Most useful voice agents need to do something beyond conversation: look up an order status, book an appointment, transfer the call to a human. This is where tool calling in voice agents gets interesting.

Tool execution adds latency. If your agent calls a database lookup tool that takes 500ms, that 500ms gets added to the response latency for that turn. The user hears silence.

Two approaches work:

Parallel tool execution: For tools that don't depend on each other, execute them simultaneously. If the agent needs to check inventory and check shipping status, run both queries at once.

Speculative pre-fetching: If you can predict what information the agent will likely need (order status for a call about an order), fetch it before the call starts and inject it into the system prompt as context. This eliminates the in-call latency entirely for the most common lookups.

For transfers to human agents, both Vapi and Retell support call transfer via SIP. Configure a fallback transfer number in your agent settings and give the LLM a tool that triggers the transfer when appropriate.

Measuring what matters

Before you ship a voice agent to real users, track these metrics:

End-to-end latency (p50 and p95): Measure from end-of-user-speech to start-of-agent-speech. The median should be under 1.0 second. The 95th percentile matters more, a slow outlier on every 20th turn degrades trust.

Call completion rate: What percentage of calls reach a natural conclusion versus the user hanging up mid-conversation. Low completion rates almost always point to latency or response quality issues.

Escalation rate: How often does the agent transfer to a human? Some escalation is expected and correct. A high rate means the agent is failing to handle cases it should handle.

Sentiment during call: Retell provides a basic sentiment classification per call. Vapi does too. Use this to surface calls where the user became frustrated, and listen to those recordings.

Deployment considerations

A few things that catch teams off guard when moving to production:

Infrastructure region: Deploy your voice agent in the same AWS or GCP region as your telephony provider's media servers. Cross-region audio adds 50-150ms of latency that you can't fix at the application layer.

Concurrency limits: Both Vapi and Retell have default concurrency limits on free and starter plans. Hitting the concurrency limit mid-campaign means calls fail silently. Know your limits and provision accordingly before any outbound campaign.

TCPA compliance for outbound calls: If you're calling US numbers, you are subject to the Telephone Consumer Protection Act. This means you need prior express written consent for marketing calls to cell phones, clear opt-out mechanisms, and time-of-day restrictions (8am-9pm local time for the called party). This is not optional. The FCC takes violations seriously, and class action exposure is significant. Use a compliance layer or at minimum consult legal counsel before running any outbound campaign at scale.

Call recording and transcripts: Both platforms store call transcripts and optionally audio recordings. Decide your retention policy before you ship, and make sure your privacy policy reflects how call data is stored.

Where to go from here

Once you have a working agent, the next iteration is almost always improving the prompt based on real call transcripts. Read 20-30 transcripts from actual calls and find the failure patterns, they're almost always predictable. The agent says something that doesn't make sense in a specific situation, or fails to handle a common user response.

For multi-agent systems where a voice agent is one component, handing off to a scheduling agent or a CRM update agent, the agent frameworks comparison guide covers how to structure those workflows. For the telephony compliance side, the AI cold calling explained guide goes deeper on TCPA and related regulations.

Voice agents in 2026 are production-ready for a growing list of use cases. The technology is not the limiting factor anymore. The limiting factor is the quality of the prompt, the tuning of the agent's behavior on real calls, and the operational work of monitoring and improving a live system.