speech-to-textvoice-agentsapi Status: active

Deepgram

Speech-to-text API and voice agent platform built for real-time low-latency applications

Deepgram is a speech-to-text API and voice agent platform built specifically for real-time low-latency workloads. Their Nova-3 model handles transcription at $0.0043 per minute with accuracy that matches or beats larger providers on most audio types. Aura provides TTS for the response side of voice agent pipelines. The Voice Agent API combines both into an end-to-end platform. Free tier includes $200 in credits to start. It's the default choice when latency is a hard constraint.

Deepgram is eleven years old, which makes it ancient by AI startup standards. It was doing neural speech recognition before the transformer era, before OpenAI released Whisper, before the current wave of speech-to-text providers entered the market. That history matters because Deepgram's Nova-3 model reflects a decade of production optimization on real-world audio, not just research benchmark performance.

The company is best understood as the speech infrastructure layer for voice applications. If you're building something where audio goes in and you need a reliable, fast transcription out, Deepgram is the starting point for evaluation.

What Deepgram actually does

The core product is speech-to-text. You send audio, you get back a transcript with word-level timestamps. That description undersells it the way saying "a database stores data" undersells PostgreSQL. The details matter.

Nova-3 is the current flagship transcription model. The meaningful specs are: word error rate competitive with or better than OpenAI Whisper Large v3 on most English audio types, real-time streaming with under 300ms latency from audio receipt to transcript delivery, and support for 37 languages. Speaker diarization, custom vocabulary, and keyword boosting all work on the same model.

The latency point deserves emphasis. Whisper, which became the open-source transcription benchmark, was designed for batch processing. You feed it an audio file, it processes the whole thing and returns a transcript. Running Whisper for real-time transcription requires either chunking audio, which introduces errors at boundaries, or streaming with a specialized setup, which adds latency and complexity. Nova-3 is designed from the ground up for streaming. The difference in real-time accuracy and latency between them in a production voice pipeline is significant.

Aura is Deepgram's TTS product, designed for the response side of voice agent pipelines. The design goal is latency, not the maximum voice quality that a product like ElevenLabs optimizes for. Aura produces speech fast enough that end-to-end voice agent response time stays under a second. The trade-off is naturalness: Aura sounds functional and clear but not as natural as ElevenLabs voices. For voice agents where the conversation mechanics matter more than the voice quality, that trade-off is usually acceptable.

The Voice Agent API combines Nova-3, an LLM integration layer, and Aura into a single websocket-based pipeline for real-time conversational agents. You connect, configure your LLM endpoint and agent persona, and the system handles the transcription, reasoning, and synthesis loop. End-to-end latency from end of user utterance to start of agent speech is typically under 700ms.

Audio intelligence add-ons layer analytical capabilities on top of transcription: sentiment analysis, topic detection, entity recognition, PII detection and redaction, and summarization. These are available as parameters on transcription requests.

Nova-3 accuracy in the real world

Benchmark accuracy numbers from controlled research audio don't always translate to production performance on real phone calls, web conference recordings, or in-person meeting audio. Deepgram publishes extensive benchmarks, but the more informative signal is how Nova-3 performs on the audio types that production voice applications actually encounter.

On telephone audio, where the 8kHz bandwidth of traditional phone lines limits what any transcription system can work with, Nova-3 is one of the strongest performers. This matters for call center and customer service applications where the audio input is constrained.

On accented English, Nova-3 performs well across a range of accents. This is partly a function of training data breadth and partly a decade of production feedback driving model improvements in the areas where customer errors occurred most.

On technical, medical, and specialized vocabulary, the custom vocabulary feature is the main tool. You can add domain-specific terms and their phonetic representations, and boost their likelihood in the decoding process. This is important for applications where standard vocabulary handling misses terminology that matters for the use case.

On noisy environments, Nova-3 is solid but not magical. Significant background noise degrades any transcription system. The robustness is better than earlier Deepgram models and competitive with current alternatives, but it has limits.

The Voice Agent platform

The Voice Agent API is Deepgram's expansion from a transcription API into a full conversational voice platform. The timing is right: as voice agents have become a mainstream application category, customers who start with Deepgram for transcription naturally want to extend to the full pipeline.

The architecture is websocket-based, which is correct for real-time voice. You maintain a persistent connection, stream audio in, receive transcription and agent response audio back. Configuration includes:

Agent persona definition via a system prompt for the LLM layer. Deepgram integrates with OpenAI, Anthropic, and other LLM providers, or you can point it at your own model endpoint.

Function calling for external data access, so the agent can look up information, trigger actions, or connect to your backend systems.

Voice configuration selecting which Aura voice the agent uses for responses.

Turn-taking sensitivity settings that control how aggressively the system detects end of utterance and begins processing.

The end-to-end latency of under 700ms is real in production conditions, not just controlled tests. That's fast enough that the conversation doesn't feel artificially delayed. For voice agents where naturalness of conversation pacing is important, this is a meaningful engineering advantage over solutions built by stitching separate STT, LLM, and TTS APIs together with HTTP calls.

The developer experience

Deepgram's SDKs are a genuine strength. Python, JavaScript/TypeScript, Go, .NET, and Rust are all first-party maintained, not community ports. The Python and JavaScript SDKs in particular are well-typed and follow idiomatic patterns for their languages.

A basic real-time transcription stream in Python:

from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

deepgram = DeepgramClient(api_key)
connection = deepgram.listen.websocket.v("1")

connection.on(LiveTranscriptionEvents.Transcript, on_message)
connection.start(LiveOptions(model="nova-3", language="en-US"))

The API design is consistent across transcription, TTS, and the Voice Agent API. Documentation is solid for the core use cases. The gaps show up on edge cases and more recent features where documentation hasn't caught up with the implementation.

The $200 free credit on signup is genuinely useful for evaluation. At $0.0043 per minute, $200 covers nearly 800 hours of transcription, which is more than enough to run real benchmarks against your actual audio data before committing to a paid plan.

Pricing: the math at scale

$0.0043 per minute for Nova-3 is approximately $0.26 per hour of audio. For a call center handling 10,000 call hours per month, that's roughly $2,600 per month at list price. Volume discounts on the Growth plan and enterprise negotiations bring that down meaningfully at scale.

The comparison against alternatives is favorable. AssemblyAI at $0.0045 per minute for their Universal-1 model is close. Google Speech-to-Text at $0.016 per minute for enhanced models is significantly higher. AWS Transcribe is comparable to Google at standard pricing. OpenAI Whisper via the API is $0.006 per minute.

At the numbers that matter for production workloads, Deepgram's pricing is competitive, particularly when you factor in the latency advantage for real-time applications where alternatives may require more expensive streaming-optimized configurations.

For the Voice Agent API, pricing includes both the STT and TTS components. Check the current pricing page because the Voice Agent API pricing structure has evolved as the product matured.

Where Deepgram fits against competitors

Deepgram vs AssemblyAI: Deepgram wins on raw STT latency and throughput for real-time streaming. AssemblyAI wins on audio intelligence features, particularly LeMUR for LLM-powered analysis of audio content. For pure voice agent transcription, Deepgram. For applications where you want to analyze call content at depth, AssemblyAI.

Deepgram vs ElevenLabs: These aren't really direct competitors. ElevenLabs is primarily a TTS and voice agent platform. Deepgram is primarily a STT platform that has added TTS. If you want the best voice output quality, ElevenLabs. If you want the best real-time transcription, Deepgram. For voice agents that need both, the choice often involves using Deepgram's STT with ElevenLabs' TTS through the Voice Agent API's custom TTS configuration, or accepting Aura's quality trade-off for the latency benefit.

Deepgram vs OpenAI Whisper (self-hosted): Running Whisper yourself is cheaper at sufficient scale but requires GPU infrastructure, model management, and real-time streaming optimizations that are non-trivial engineering work. Deepgram's managed API is worth the per-minute cost for most teams until you're large enough that the infrastructure investment clearly pays off.

Who should use Deepgram

Teams building real-time voice applications where latency is a hard constraint. If your voice agent needs to respond in under a second, and most real-time voice experiences do, Deepgram's STT latency makes it the default choice to evaluate first.

Call center and customer service applications that need accurate transcription of telephone audio at scale and want to keep infrastructure complexity minimal.

Developers who want a single API for the full voice agent pipeline and are willing to accept Aura's voice quality level. The Voice Agent API reduces the number of services to integrate and monitor in production.

Applications with specialized vocabulary requirements where custom vocabulary and keyword boosting are important. Medical, legal, technical, and financial audio all have terminology that generic models handle inconsistently.

The bottom line

Deepgram has been doing production speech-to-text for eleven years and Nova-3 reflects that experience. The latency, the real-world accuracy, and the SDK quality are all at a level where it's the default starting point for real-time transcription applications in 2026. Aura TTS is functional but not the quality leader on the synthesis side, and ElevenLabs remains the better choice if voice output quality is a priority. The Voice Agent API is a compelling complete package for teams that want one vendor for the full pipeline. At $0.0043 per minute with $200 in free credits, the evaluation economics are straightforward.

Key features

Nova-3 model for transcription with best-in-class accuracy on English and 36 other languages
Real-time streaming transcription with word-level timestamps and under 300ms latency
Aura TTS for low-latency text-to-speech optimized for voice agent pipelines
Speaker diarization for multi-speaker audio separation
Deepgram Voice Agent API for end-to-end voice agent deployment
Custom vocabulary and keyword boosting for domain-specific audio
Sentiment analysis, topic detection, and summarization add-ons
Whisper model access via the same API for broader language support

Pros and cons

Pros

+ Industry-leading real-time transcription latency, under 300ms on most audio
+ Nova-3 accuracy competitive with or better than OpenAI Whisper and Google STT on English
+ Pricing at $0.0043 per minute is among the most competitive for high-accuracy models
+ SDKs available in Python, JavaScript, Go, .NET, and Rust with consistent quality
+ Voice Agent API covers the full pipeline without stitching separate services
+ Custom vocabulary and keyword boosting for technical, medical, and specialized content
+ Consistent latency under real-world network conditions, not just controlled benchmarks

Cons

− Language support outside English is strong but not as deep as some competitors
− Aura TTS voice quality is functional but not at ElevenLabs level for naturalness
− Sentiment and topic detection add-ons are basic compared to dedicated audio intelligence platforms
− No web studio interface for non-developers, everything goes through the API
− Documentation quality is good but inconsistently updated across SDK versions
− Enterprise support response times can lag during growth periods

Who is Deepgram for?

Real-time transcription for customer service calls, sales calls, and meeting recording
Voice agent pipelines where end-to-end latency under one second is required
Captioning and live transcription for video streaming and broadcasting
Transcription infrastructure for medical, legal, and technical audio with custom vocabulary

Alternatives to Deepgram

If Deepgram isn't quite the right fit, the closest alternatives are assemblyai , and elevenlabs . See our full Deepgram alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is Deepgram?

Deepgram is a speech-to-text and voice agent platform that specializes in real-time, low-latency audio processing. Their Nova-3 model handles transcription for production voice applications where speed and accuracy both matter. The platform also includes Aura for TTS and a Voice Agent API that combines transcription, LLM reasoning, and speech synthesis in a single pipeline. It's built as an API-first product aimed at developers, with SDKs for most major languages.

How much does Deepgram cost?

Nova-3 transcription starts at $0.0043 per minute on Pay-As-You-Go, which is roughly $0.26 per hour of audio. New accounts get $200 in free credits. Aura TTS is priced separately per character. Volume pricing applies at higher usage tiers on the Growth plan. Enterprise contracts are negotiated and include SLAs, dedicated infrastructure, and custom per-unit pricing. Whisper model access through the Deepgram API is available at different pricing. Check the current pricing page for the latest rates since tier structures update periodically.

What is Nova-3 and how accurate is it?

Nova-3 is Deepgram's latest speech-to-text model as of mid-2026. It achieves word error rates competitive with OpenAI Whisper Large v3 on English audio and outperforms it on many real-world audio conditions including telephone audio, accented speech, and noisy environments. Nova-3 is specifically optimized for real-time streaming, which is where the comparison against Whisper is most relevant, since Whisper was originally designed for batch processing. Word-level timestamps, speaker diarization, and keyword boosting all work with Nova-3.

How does Deepgram compare to AssemblyAI?

Deepgram and AssemblyAI are both speech-to-text API platforms with real-time capabilities, but they're optimized differently. Deepgram prioritizes raw latency and throughput for voice agent and live transcription applications. AssemblyAI prioritizes audio intelligence features like LeMUR for LLM-powered analysis of transcripts, richer entity detection, and PII redaction. If your primary need is fast, accurate, real-time transcription for a voice pipeline, Deepgram has the edge on latency. If you need deeper analytical processing on top of the transcript, AssemblyAI's feature set is broader. Pricing is roughly comparable for baseline transcription.

What is the Deepgram Voice Agent API?

The Voice Agent API is Deepgram's end-to-end platform for building real-time voice agents. It combines Nova-3 for speech-to-text, an LLM integration layer for reasoning, and Aura TTS for the response, all in a single websocket connection. You configure the agent's persona, connect your LLM of choice, and the pipeline handles the real-time audio processing. The end-to-end latency from end of user speech to start of agent response is typically under 700ms, which is competitive with purpose-built voice agent platforms.

Related agents

Air AI

AI sales agent for extended outbound phone conversations up to 40 minutes focused on appointment setting

voice-agentssales From $99/mo

Anthropic Computer Use

Claude's computer-use capability that powers desktop and browser agents

Featured

autonomouscomputer-use Paid

AssemblyAI

Speech-to-text API and audio intelligence platform with LLM-powered analysis via LeMUR

speech-to-textaudio-intelligence Free tier

206 ★ — 0.0%