voice-agentsdeveloper-toolsopen-source Status: active

LiveKit Agents

Open-source framework for building real-time voice and multimodal AI agents that run in production

LiveKit Agents is an open-source Python and Node.js framework for building real-time voice and multimodal AI agents. Built on LiveKit's WebRTC infrastructure, it provides the pipeline primitives developers need to connect speech-to-text, language models, and text-to-speech into production voice agents. The framework handles turn-taking, voice activity detection, latency optimization, and scaling concerns. Developers configure the STT, LLM, and TTS providers they want at each stage. LiveKit Cloud provides managed infrastructure; self-hosting is supported on your own servers.

LiveKit was building real-time communications infrastructure before the current AI voice agent wave arrived. The company's WebRTC platform already handled the hard problems of audio transport: jitter buffers, packet loss recovery, echo cancellation, and delivering low-latency audio to participants across variable network conditions.

When voice AI became a serious product category in 2024, LiveKit had infrastructure that voice agent developers needed. The Agents framework built on top of that foundation, providing the pipeline layer that connects AI models to the real-time transport.

The result is a stack where the real-time audio engineering is handled by a team that's been doing it for years, and developers focus on the agent logic rather than audio infrastructure.

The pipeline model

A voice agent built with LiveKit Agents has three core stages: speech-to-text, language model, and text-to-speech. The framework provides pipeline primitives that connect these stages and handles the coordination between them.

Speech-to-text receives audio from the user's microphone and produces text transcription. LiveKit Agents includes plugins for Deepgram, AssemblyAI, Google, Whisper, and others. Each has different tradeoffs: Deepgram is fast and accurate for conversational English; Google handles more languages; Whisper is slower but free to run locally.

The language model receives the transcribed text, along with the conversation history and system prompt, and produces a text response. Any LLM with a standard completion API works here: Claude 3.7 Sonnet, GPT-4o, Llama, Mistral, or models running locally via Ollama.

Text-to-speech converts the model's text response to audio that gets streamed back to the user. Provider options include ElevenLabs, Cartesia, PlayHT, Google Text-to-Speech, and OpenAI TTS. Each has different latency, voice quality, and cost characteristics.

The framework handles the handoffs between stages: it detects when the user has finished speaking (voice activity detection), manages turn-taking logic to prevent the agent from interrupting mid-sentence, and streams TTS audio back as the LLM generates rather than waiting for the full response.

Voice activity detection and interruption handling

Two of the hardest engineering problems in voice AI are detecting when the user has finished speaking and handling user interruptions of the agent.

Voice activity detection (VAD) determines when a silence means "the user has finished their turn" versus "the user paused mid-sentence." Too sensitive and the agent cuts off constantly. Too insensitive and there's awkward silence waiting for the transcript. LiveKit Agents uses Silero VAD, a model-based approach, which is more accurate than energy-based VAD for natural conversation.

Interruption handling determines what happens when the user speaks while the agent is still responding. LiveKit Agents pauses the TTS output and routes the new user speech through the pipeline as a new turn, maintaining natural conversation flow rather than making the user wait for the agent to finish its current response.

These are not trivial to implement correctly, and getting them wrong makes voice agents feel broken. Having them handled by the framework rather than each developer reimplementing them is a real productivity gain.

Multimodal capabilities

LiveKit is a real-time communications platform for more than just audio. Video streams, data channels, and screen sharing are all native capabilities. The Agents framework can operate in environments that combine voice with video, which opens several use cases beyond pure voice agents.

A virtual video assistant that can see what you're looking at through your camera. An agent that can view a shared screen and answer questions about what's displayed. A customer support agent that can do a video call, share a screen, and also listen and respond to audio.

These multimodal combinations are harder to build outside of LiveKit's stack because they require coordinating real-time audio and video streams simultaneously. The platform's origin in WebRTC communications makes it the natural choice for anyone building agents that go beyond voice-only.

Scaling with worker pools

Production voice agents need to handle many concurrent sessions. LiveKit Agents uses a worker pool architecture: you define an agent worker that handles a single session, and the framework dispatches incoming sessions to available workers and manages the pool.

Workers are stateless relative to each other and can be scaled horizontally. Running 50 concurrent voice sessions means running 50 worker processes, which can be distributed across multiple servers or containers. The LiveKit Cloud dispatch service handles the coordination; if you're self-hosting, you run the LiveKit SFU alongside your worker pool.

This architecture makes scaling predictable. Each worker consumes roughly constant resources for an active session. Adding capacity means adding workers. The latency profile per session doesn't degrade as you scale the number of sessions.

LiveKit Cloud vs self-hosting

LiveKit Cloud is the managed infrastructure option. You run your agent code; LiveKit runs the SFU, handles WebRTC connection management, and provides the participant dispatch infrastructure. Pricing is around $0.006 per participant minute, which for a voice agent session is per minute of active conversation time.

Self-hosting LiveKit Server is the alternative. You run the SFU on your own infrastructure alongside your agent workers. The SFU is open-source under the Apache 2.0 license. For organizations with the infrastructure capacity, self-hosting eliminates the per-minute cost and keeps all audio traffic on your own network.

The practical decision: use LiveKit Cloud to get started quickly and to evaluate production latency before committing to infrastructure. Move to self-hosted if your volume makes the per-minute cost significant and you have the devops capacity to run it.

Comparing to managed voice platforms

Against Vapi: Vapi is a fully managed platform where you configure agents via API without writing framework code. LiveKit Agents requires writing Python or Node.js code and handling infrastructure. Vapi is faster from zero to working voice agent. LiveKit Agents is more flexible, cheaper at scale, fully open-source, and gives complete control over the STT, LLM, and TTS providers. The choice maps to whether you're willing to do engineering work for the control and cost benefits.

Against Retell AI: similar comparison to Vapi. Retell is a managed platform. LiveKit is a framework. Developer experience differs accordingly.

Getting started

The quickstart at docs.livekit.io/agents runs a basic voice agent locally in about 20 minutes. You need Python 3.9+, a LiveKit Cloud account (free tier), and API keys for your chosen STT and TTS providers.

The minimal working example is short: define an agent class, configure the pipeline stages with your provider choices, and run the worker. Once running, you can join a LiveKit room and talk to your agent.

For production deployment, the worker pool pattern scales from the same codebase as the local development version. The main change is pointing the worker at a production LiveKit server and running workers in a process manager or container orchestrator.

The GitHub repository has active issues and discussions. The community Discord has traffic from developers actively building voice agents, which is useful for troubleshooting the provider combination and latency questions that don't have clean answers in documentation.

Key features

Python and Node.js SDKs for building voice AI agents with real-time WebRTC transport
Pre-built pipelines for STT, LLM, and TTS with configurable providers at each stage
Multimodal support for voice plus video plus text in unified agent sessions
Plugin system supporting OpenAI Realtime API, Cartesia, Deepgram, ElevenLabs, and others
Worker pool architecture for handling many concurrent agent sessions
Voice activity detection and turn-taking logic built into the framework
Room-based session model for multi-party conversations

Pros and cons

Pros

+ Open-source MIT license with full framework code auditable and extensible
+ WebRTC transport gives low-latency voice delivery without custom audio infrastructure
+ Pluggable provider architecture lets you swap STT, LLM, and TTS at each stage
+ Worker pool model scales to many concurrent voice sessions
+ Active development with frequent updates and responsive maintainer team
+ Supports OpenAI Realtime API for even lower latency voice pipelines

Cons

− Requires coding in Python or Node.js; no no-code builder available
− Infrastructure setup is non-trivial compared to fully managed platforms like Vapi
− Production deployment requires understanding WebRTC and real-time audio systems
− Less polished onboarding documentation compared to fully managed voice platforms
− Latency tuning requires experimentation with provider combinations

Who is LiveKit Agents for?

Building production voice agents for customer support or virtual assistants
Developers who need full control over STT, LLM, and TTS provider choices
Teams self-hosting voice AI infrastructure for data privacy or cost reasons
Multimodal agents that combine voice with video or visual understanding

Alternatives to LiveKit Agents

If LiveKit Agents isn't quite the right fit, the closest alternatives are vapi , fixie-ai , and synthflow . See our full LiveKit Agents alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is LiveKit Agents?

LiveKit Agents is an open-source software framework for building real-time AI agents that communicate through voice, video, or text. It's built on top of LiveKit, which is a real-time communications platform that handles WebRTC connections, audio transport, and scaling. The Agents layer provides the pipeline abstractions for connecting speech recognition, language models, and speech synthesis into a working voice agent. You write Python or Node.js code that defines how your agent should behave, and LiveKit handles the real-time audio infrastructure underneath.

How is LiveKit Agents different from Vapi?

Vapi is a managed voice AI platform with a REST API: you configure your agent via API calls and Vapi handles everything including infrastructure, STT, LLM integration, and TTS. LiveKit Agents is a framework where you write the code yourself and either use LiveKit Cloud for infrastructure or self-host LiveKit Server. Vapi is faster to get started and requires no infrastructure knowledge. LiveKit Agents gives more control and transparency, costs less at scale for developers who can handle the infrastructure work, and is fully open-source. The choice is managed simplicity versus open-source control.

What speech-to-text providers does LiveKit Agents support?

LiveKit Agents has plugins for major STT providers including Deepgram, AssemblyAI, Google Speech-to-Text, OpenAI Whisper, and others. The plugin architecture means new providers can be added. You configure which STT provider to use in your agent code, and swapping providers typically requires changing a few lines of configuration rather than restructuring your pipeline. Deepgram is commonly used for production voice agents due to its low latency and strong accuracy on conversational speech.

Can I use LiveKit Agents with OpenAI Realtime API?

Yes. LiveKit Agents has native support for OpenAI's Realtime API, which allows voice-to-voice conversations with GPT-4o without a separate speech recognition step. The Realtime API handles audio input directly, eliminating one stage of the STT-LLM-TTS pipeline and reducing latency. For use cases where the absolute minimum latency is required and you want to use OpenAI's models, the Realtime API integration is the recommended path. For use cases where you need to use specific STT or TTS providers that aren't OpenAI's, the standard pipeline still applies.

Is LiveKit Agents production-ready?

Yes. The framework has been used in production deployments since 2024 and has handled large-scale concurrent sessions. LiveKit itself powers real-time communications for major applications and has production infrastructure at scale. The worker pool architecture handles concurrent sessions and scales horizontally. The main production readiness consideration is that it requires engineering work to deploy and operate, unlike a fully managed platform. If you have the engineering capacity, the production performance is solid. If you need a turnkey solution with minimal ops burden, a managed platform is more appropriate.

Related agents

Aide

Open-source AI-native IDE built on VS Code with agent-first workflows and local memory

codingide Free tier

2,193 ★ — 0.0%

Air AI

AI sales agent for extended outbound phone conversations up to 40 minutes focused on appointment setting

voice-agentssales From $99/mo

Anthropic Skills

Pre-built and custom skills for Claude that extend what Claude can do in Claude Code

developer-toolsproductivity Free tier