LiveKit Agents
Open-source framework for building real-time voice and multimodal AI agents that run in production
LiveKit Agents is an open-source Python and Node.js framework for building real-time voice and multimodal AI agents. Built on LiveKit's WebRTC infrastructure, it provides the pipeline primitives developers need to connect speech-to-text, language models, and text-to-speech into production voice agents. The framework handles turn-taking, voice activity detection, latency optimization, and scaling concerns. Developers configure the STT, LLM, and TTS providers they want at each stage. LiveKit Cloud provides managed infrastructure; self-hosting is supported on your own servers.
LiveKit was building real-time communications infrastructure before the current AI voice agent wave arrived. The company's WebRTC platform already handled the hard problems of audio transport: jitter buffers, packet loss recovery, echo cancellation, and delivering low-latency audio to participants across variable network conditions.
When voice AI became a serious product category in 2024, LiveKit had infrastructure that voice agent developers needed. The Agents framework built on top of that foundation, providing the pipeline layer that connects AI models to the real-time transport.
The result is a stack where the real-time audio engineering is handled by a team that's been doing it for years, and developers focus on the agent logic rather than audio infrastructure.
The pipeline model
A voice agent built with LiveKit Agents has three core stages: speech-to-text, language model, and text-to-speech. The framework provides pipeline primitives that connect these stages and handles the coordination between them.
Speech-to-text receives audio from the user's microphone and produces text transcription. LiveKit Agents includes plugins for Deepgram, AssemblyAI, Google, Whisper, and others. Each has different tradeoffs: Deepgram is fast and accurate for conversational English; Google handles more languages; Whisper is slower but free to run locally.
The language model receives the transcribed text, along with the conversation history and system prompt, and produces a text response. Any LLM with a standard completion API works here: Claude 3.7 Sonnet, GPT-4o, Llama, Mistral, or models running locally via Ollama.
Text-to-speech converts the model's text response to audio that gets streamed back to the user. Provider options include ElevenLabs, Cartesia, PlayHT, Google Text-to-Speech, and OpenAI TTS. Each has different latency, voice quality, and cost characteristics.
The framework handles the handoffs between stages: it detects when the user has finished speaking (voice activity detection), manages turn-taking logic to prevent the agent from interrupting mid-sentence, and streams TTS audio back as the LLM generates rather than waiting for the full response.
Voice activity detection and interruption handling
Two of the hardest engineering problems in voice AI are detecting when the user has finished speaking and handling user interruptions of the agent.
Voice activity detection (VAD) determines when a silence means "the user has finished their turn" versus "the user paused mid-sentence." Too sensitive and the agent cuts off constantly. Too insensitive and there's awkward silence waiting for the transcript. LiveKit Agents uses Silero VAD, a model-based approach, which is more accurate than energy-based VAD for natural conversation.
Interruption handling determines what happens when the user speaks while the agent is still responding. LiveKit Agents pauses the TTS output and routes the new user speech through the pipeline as a new turn, maintaining natural conversation flow rather than making the user wait for the agent to finish its current response.
These are not trivial to implement correctly, and getting them wrong makes voice agents feel broken. Having them handled by the framework rather than each developer reimplementing them is a real productivity gain.
Multimodal capabilities
LiveKit is a real-time communications platform for more than just audio. Video streams, data channels, and screen sharing are all native capabilities. The Agents framework can operate in environments that combine voice with video, which opens several use cases beyond pure voice agents.
A virtual video assistant that can see what you're looking at through your camera. An agent that can view a shared screen and answer questions about what's displayed. A customer support agent that can do a video call, share a screen, and also listen and respond to audio.
These multimodal combinations are harder to build outside of LiveKit's stack because they require coordinating real-time audio and video streams simultaneously. The platform's origin in WebRTC communications makes it the natural choice for anyone building agents that go beyond voice-only.
Scaling with worker pools
Production voice agents need to handle many concurrent sessions. LiveKit Agents uses a worker pool architecture: you define an agent worker that handles a single session, and the framework dispatches incoming sessions to available workers and manages the pool.
Workers are stateless relative to each other and can be scaled horizontally. Running 50 concurrent voice sessions means running 50 worker processes, which can be distributed across multiple servers or containers. The LiveKit Cloud dispatch service handles the coordination; if you're self-hosting, you run the LiveKit SFU alongside your worker pool.
This architecture makes scaling predictable. Each worker consumes roughly constant resources for an active session. Adding capacity means adding workers. The latency profile per session doesn't degrade as you scale the number of sessions.
LiveKit Cloud vs self-hosting
LiveKit Cloud is the managed infrastructure option. You run your agent code; LiveKit runs the SFU, handles WebRTC connection management, and provides the participant dispatch infrastructure. Pricing is around $0.006 per participant minute, which for a voice agent session is per minute of active conversation time.
Self-hosting LiveKit Server is the alternative. You run the SFU on your own infrastructure alongside your agent workers. The SFU is open-source under the Apache 2.0 license. For organizations with the infrastructure capacity, self-hosting eliminates the per-minute cost and keeps all audio traffic on your own network.
The practical decision: use LiveKit Cloud to get started quickly and to evaluate production latency before committing to infrastructure. Move to self-hosted if your volume makes the per-minute cost significant and you have the devops capacity to run it.
Comparing to managed voice platforms
Against Vapi: Vapi is a fully managed platform where you configure agents via API without writing framework code. LiveKit Agents requires writing Python or Node.js code and handling infrastructure. Vapi is faster from zero to working voice agent. LiveKit Agents is more flexible, cheaper at scale, fully open-source, and gives complete control over the STT, LLM, and TTS providers. The choice maps to whether you're willing to do engineering work for the control and cost benefits.
Against Retell AI: similar comparison to Vapi. Retell is a managed platform. LiveKit is a framework. Developer experience differs accordingly.
Getting started
The quickstart at docs.livekit.io/agents runs a basic voice agent locally in about 20 minutes. You need Python 3.9+, a LiveKit Cloud account (free tier), and API keys for your chosen STT and TTS providers.
The minimal working example is short: define an agent class, configure the pipeline stages with your provider choices, and run the worker. Once running, you can join a LiveKit room and talk to your agent.
For production deployment, the worker pool pattern scales from the same codebase as the local development version. The main change is pointing the worker at a production LiveKit server and running workers in a process manager or container orchestrator.
The GitHub repository has active issues and discussions. The community Discord has traffic from developers actively building voice agents, which is useful for troubleshooting the provider combination and latency questions that don't have clean answers in documentation.
Key features
- Python and Node.js SDKs for building voice AI agents with real-time WebRTC transport
- Pre-built pipelines for STT, LLM, and TTS with configurable providers at each stage
- Multimodal support for voice plus video plus text in unified agent sessions
- Plugin system supporting OpenAI Realtime API, Cartesia, Deepgram, ElevenLabs, and others
- Worker pool architecture for handling many concurrent agent sessions
- Voice activity detection and turn-taking logic built into the framework
- Room-based session model for multi-party conversations
Pros and cons
Pros
- + Open-source MIT license with full framework code auditable and extensible
- + WebRTC transport gives low-latency voice delivery without custom audio infrastructure
- + Pluggable provider architecture lets you swap STT, LLM, and TTS at each stage
- + Worker pool model scales to many concurrent voice sessions
- + Active development with frequent updates and responsive maintainer team
- + Supports OpenAI Realtime API for even lower latency voice pipelines
Cons
- − Requires coding in Python or Node.js; no no-code builder available
- − Infrastructure setup is non-trivial compared to fully managed platforms like Vapi
- − Production deployment requires understanding WebRTC and real-time audio systems
- − Less polished onboarding documentation compared to fully managed voice platforms
- − Latency tuning requires experimentation with provider combinations
Who is LiveKit Agents for?
- Building production voice agents for customer support or virtual assistants
- Developers who need full control over STT, LLM, and TTS provider choices
- Teams self-hosting voice AI infrastructure for data privacy or cost reasons
- Multimodal agents that combine voice with video or visual understanding
Alternatives to LiveKit Agents
If LiveKit Agents isn't quite the right fit, the closest alternatives are vapi , fixie-ai , and synthflow . See our full LiveKit Agents alternatives page for side-by-side comparisons.
Frequently Asked Questions
What is LiveKit Agents?
How is LiveKit Agents different from Vapi?
What speech-to-text providers does LiveKit Agents support?
Can I use LiveKit Agents with OpenAI Realtime API?
Is LiveKit Agents production-ready?
Related agents
Aide
Open-source AI-native IDE built on VS Code with agent-first workflows and local memory
Air AI
AI sales agent for extended outbound phone conversations up to 40 minutes focused on appointment setting
Anthropic Skills
Pre-built and custom skills for Claude that extend what Claude can do in Claude Code