speech-to-textaudio-intelligenceapi Status: active

AssemblyAI

Speech-to-text API and audio intelligence platform with LLM-powered analysis via LeMUR

AssemblyAI is a speech-to-text API and audio intelligence platform that goes beyond transcription with LeMUR, their LLM-powered layer for analyzing audio content. Universal-1 at $0.0045 per minute handles production transcription. Universal-2 at $1.65 per hour delivers the highest accuracy tier. LeMUR lets you ask questions, generate summaries, and run custom analysis against any transcript. Real-time and async APIs cover the full range of deployment patterns. SDKs for 6 languages make integration straightforward.

AssemblyAI started with a simple thesis: developers needed a speech-to-text API that actually worked. When they launched in 2017, the options were Google, Amazon, and Microsoft, all of which worked reasonably well but required you to integrate with their broader cloud ecosystems and accept the limitations of those general-purpose platforms. AssemblyAI built a dedicated service.

What's more interesting is what they built on top of transcription. LeMUR, the LLM-powered analysis layer, is the feature that distinguishes AssemblyAI from the rest of the transcription API market in 2026 and the reason developers doing audio intelligence work tend to land here.

What the platform provides

The product has two logical layers. The first is speech-to-text across four models. The second is audio intelligence, which is where the differentiation lives.

Transcription models cover different cost-accuracy points:

Universal-2 is the accuracy-optimized flagship for English. It's the model to use when transcript quality directly affects the downstream output quality and the cost premium is justified. At $1.65 per hour, it's expensive relative to baseline options but produces accuracy competitive with the best available models on clear speech and handles accented speech, telephone audio, and noisy environments better than its predecessor.

Universal-1 is the production workhorse at $0.0045 per minute. For the vast majority of business audio, this model is accurate enough and the economics at scale make more sense than Universal-2 unless you have specific accuracy requirements.

Best is the throughput-optimized model at $0.0036 per minute. It's slightly lower accuracy than Universal-1 but meaningfully faster for batch processing at high volume.

Nano is the lightweight model at $0.0005 per minute for simple, clear speech where cost is the primary concern.

All models support real-time streaming, speaker diarization, word-level timestamps, custom vocabulary, and language detection. Real-time streaming is solid, competitive with Deepgram on accuracy, slightly higher latency in controlled tests.

Audio intelligence features are the layer that sets AssemblyAI apart from pure transcription services:

Summarization generates a condensed summary of the full audio content. You get a coherent summary paragraph from a 90-minute meeting without writing a single line of summary logic.

Auto chapters breaks long audio into logical sections with titles, useful for podcast indexing, lecture capture, and any content where navigation matters.

Key phrases extracts the most important phrases from the transcript ranked by relevance.

Sentiment analysis returns per-sentence sentiment scores across the transcript.

Entity detection identifies people, organizations, locations, products, and other named entities in the audio.

Content moderation flags audio for hate speech, profanity, violence, and other content categories.

PII detection and redaction automatically identifies and optionally removes personally identifiable information from transcripts.

Speaker diarization separates speakers in multi-person recordings with consistent speaker labels across the transcript.

LeMUR in depth

LeMUR is the most compelling reason to choose AssemblyAI over alternatives for audio intelligence use cases.

The standard audio intelligence features are automatic and rule-based. Summarization generates a summary. Sentiment analysis scores sentiment. They work well for their specific tasks and they're cheap to use at scale.

LeMUR is different. You take a transcribed audio file and you ask it anything. "What were the three main pain points the customer mentioned?" "Summarize the action items assigned to each person." "Did the sales rep follow the compliance script for disclosures?" "What was the customer's emotional trajectory through this call?"

These aren't questions that keyword extraction or rule-based entity detection can answer reliably. They require understanding the content and context of the conversation, and that's what a language model is good at. LeMUR provides that capability against your audio content with a single API call.

The practical applications are substantial:

Meeting analysis: Transcribe a recorded meeting, run LeMUR to extract action items with owners and deadlines, push those to your project management system. This is a real workflow that teams are running in production using AssemblyAI and an automation layer.

Call center quality analysis: Transcribe customer service calls, use LeMUR to evaluate whether the agent followed the script, handled objections appropriately, and provided accurate information. At scale, this replaces manual call sampling with systematic coverage.

Podcast and content indexing: Transcribe long-form audio, use LeMUR to generate show notes, episode summaries, and discussion topic tags for search indexing.

Compliance monitoring: In regulated industries, use LeMUR to flag calls where specific disclosures weren't made or where prohibited advice was given.

LeMUR is priced per token, and token costs accumulate on long transcripts with complex analysis. A one-hour meeting transcript is roughly 8,000-12,000 words, which at typical LLM tokenization rates is 10,000-15,000 input tokens. The cost is real but typically justified for the analysis use cases where manual processing would cost much more.

The developer experience

AssemblyAI has the broadest SDK support in the speech-to-text category. Python, JavaScript, Java, Ruby, Go, and C# are all officially maintained. For teams working in Java or Ruby, which are underserved by many AI API providers, this matters practically.

A basic async transcription in Python:

import assemblyai as aai

aai.settings.api_key = "your-api-key"
transcriber = aai.Transcriber()

transcript = transcriber.transcribe("https://your-audio-file.com/audio.mp3")
print(transcript.text)

For LeMUR analysis on a completed transcript:

result = transcript.lemur.task(
    "List the action items from this meeting with the name of who is responsible for each item.",
    final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)

The SDK design is clean and the async pattern is consistent. Real-time streaming uses the same SDK with a streaming-specific interface.

Documentation quality is good for core use cases. The worked examples for LeMUR could be more extensive, particularly for complex multi-step analysis workflows, but the basics are well-covered.

PII and compliance

The PII detection and redaction feature deserves more attention than it usually gets in discussions of AssemblyAI.

For applications processing audio that contains personal information, the choices are typically: don't send it to a cloud API at all (which means self-hosting), implement your own redaction after transcription, or use a service with built-in redaction. AssemblyAI's PII redaction built into the transcription pipeline removes the need for a post-processing redaction step.

The redaction covers names, phone numbers, email addresses, social security numbers, medical record numbers, financial account numbers, and other common PII categories. You can configure which categories to detect and whether to redact them from the transcript or just flag their location.

For medical applications, AssemblyAI offers a Business Associate Agreement, making it viable for HIPAA-regulated workloads. For legal and financial applications with similar data handling requirements, the built-in compliance features reduce the infrastructure and legal overhead compared to building your own redaction pipeline on top of a provider that doesn't offer it.

Where AssemblyAI fits vs competitors

Against Deepgram: Deepgram has lower latency for real-time streaming and a stronger voice agent story with their Voice Agent API and Aura TTS. AssemblyAI has more audio intelligence features and LeMUR. For pure voice agent pipelines where response time is critical, Deepgram. For applications where the value is in analyzing audio content after the fact, AssemblyAI.

Against Descript: Descript is a media production application, not an API service. They share the transcription capability but serve different use cases entirely. Descript for audio/video editing with transcript-based workflows. AssemblyAI for programmatic audio processing at scale.

Against ElevenLabs: ElevenLabs is primarily TTS with a voice agent platform. AssemblyAI is STT with audio intelligence. The overlap is minimal except in the voice agent space where both offer different ends of the pipeline.

Pricing at scale

For a production application processing significant audio volume, the math looks like this:

100 hours per day of Universal-1 transcription: $27/day, $810/month. Same volume with Universal-2: $165/day, $4,950/month. LeMUR analysis on 10% of transcripts at average complexity: variable, roughly $100-500/month depending on query complexity and transcript length.

For call centers, the Universal-1 numbers are often compelling compared to manual quality assurance costs. For research applications requiring maximum accuracy, Universal-2 pricing is steep but often justified by the downstream use of the transcript.

Volume pricing applies at enterprise tiers. If you're processing thousands of hours monthly, the Pay-As-You-Go rates are not the rates you'll actually pay.

Who should use AssemblyAI

Teams building audio intelligence applications where understanding transcript content matters beyond simple keyword matching. LeMUR is the reason to choose AssemblyAI for this category.

Compliance-sensitive applications in healthcare, legal, and financial services that need PII redaction, content moderation, and the option for a compliance framework (HIPAA BAA).

Call center and customer success analytics at companies that want to analyze more calls than human QA teams can manually review.

Developers in Java, Ruby, or Go who benefit from first-party SDK support that Deepgram doesn't offer.

Podcast and media companies building indexing, search, and discovery features on top of audio content where transcription plus intelligent analysis is the core capability.

What AssemblyAI doesn't cover

AssemblyAI has no TTS product. If you're building a voice agent that needs to speak as well as listen, you're combining AssemblyAI with a separate synthesis service. For a complete voice pipeline, ElevenLabs for TTS quality or Deepgram for low-latency integrated pipelines are the synthesis side.

The language support outside English is improving but is a relative weakness. Universal-2 is English-focused. Universal-1 supports multiple languages but with lower accuracy on non-English audio than on English. For applications with significant multilingual audio volume, evaluate the specific language performance carefully before committing.

The bottom line

AssemblyAI is the transcription API to reach for when you need to do something with the transcript beyond having it as text. LeMUR turns recorded audio into addressable content in a way that no other API makes as easy. Universal-2 accuracy is among the best available for English. The compliance features handle requirements that would otherwise require significant custom work. The pricing is competitive at Universal-1 and expensive at Universal-2, which is the right structure for the capability tiers. For audio intelligence use cases, it's the default evaluation starting point in 2026.

Key features

Universal-2 model for highest-accuracy English transcription with speaker diarization
Universal-1 for production transcription balancing accuracy and cost
LeMUR for LLM-powered analysis on audio transcripts, summarization, Q&A, custom analysis
Real-time streaming transcription for live audio applications
Speaker diarization to separate multiple speakers in a recording
PII detection and redaction for compliance use cases
Auto chapters, highlights, and key phrase extraction
Content moderation for audio identifying sensitive topics
Entity detection and sentiment analysis on transcript text

Pros and cons

Pros

+ LeMUR is a genuinely differentiated feature for extracting intelligence from audio at scale
+ Universal-2 accuracy on English is competitive with the best available models
+ PII detection and redaction built in for compliance-sensitive applications
+ Broadest SDK language support in the category: Python, JavaScript, Java, Ruby, Go, C#
+ Auto chapters, highlights, and summarization work well out of the box
+ Real-time streaming quality is strong, competitive with Deepgram on most audio types
+ Content moderation for audio is a rare built-in capability

Cons

− Universal-2 at $1.65/hour is significantly more expensive than baseline transcription options
− LeMUR token costs add up quickly on long audio files with complex analysis
− Language support outside English, though improving, is less thorough than some competitors
− No TTS product, so voice agent deployments require a separate synthesis service
− Documentation is good but some advanced features lack worked examples
− Real-time latency is slightly higher than Deepgram on equivalent audio in controlled tests

Who is AssemblyAI for?

Meeting and call recording with automatic summarization and action item extraction via LeMUR
Call center quality analysis and compliance monitoring at scale
Podcast and long-form audio content analysis and indexing
Medical and legal transcription with PII redaction and speaker identification

Alternatives to AssemblyAI

If AssemblyAI isn't quite the right fit, the closest alternatives are deepgram , descript , and elevenlabs . See our full AssemblyAI alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is AssemblyAI?

AssemblyAI is a speech-to-text API and audio intelligence platform. The core product is transcription, converting audio to text with speaker identification, timestamps, and language detection. On top of that, they offer LeMUR, an LLM-powered layer that lets you run natural language analysis against your transcripts: ask questions, generate summaries, extract specific information, or apply custom analysis logic. They support async transcription for uploaded files and real-time streaming for live audio. The platform is built as an API with SDKs for six programming languages.

How much does AssemblyAI cost?

AssemblyAI pricing is per minute of audio. Universal-1 is $0.0045 per minute ($0.27/hour), which is the main production tier for most applications. Universal-2, the highest-accuracy model, is $1.65 per hour ($0.0275/min). Best, optimized for throughput applications, is $0.0036 per minute. Nano, for simple speech recognition, is $0.0005 per minute. LeMUR usage is priced per token separately. New accounts get a free tier with initial credits. There's no monthly minimum on Pay-As-You-Go.

What is LeMUR and when should I use it?

LeMUR is AssemblyAI's LLM-powered analysis layer that works on audio transcripts. You transcribe an audio file, then submit a natural language query against that transcript. Use cases include: summarizing a one-hour meeting into bullet points, answering "what were the three main concerns raised by the customer" on a sales call transcript, extracting specific data points from medical consultation recordings, or flagging compliance issues in recorded customer service calls. LeMUR is priced per token and works best for analysis tasks where the structure of the audio content is complex enough that simple keyword matching or basic entity extraction doesn't capture what you need.

How does AssemblyAI compare to Deepgram?

Both are speech-to-text API platforms with real-time capabilities, but they're optimized for different things. Deepgram prioritizes low latency for real-time voice applications and has the Voice Agent API for end-to-end voice pipeline deployment. AssemblyAI prioritizes audio intelligence features: LeMUR, richer content analysis, PII redaction, and content moderation. On raw transcription accuracy for English, Universal-2 is among the best available. On real-time latency, Deepgram has a measurable edge for the lowest-latency use cases. If you're building a voice agent where response speed is critical, Deepgram. If you're analyzing recorded audio for business intelligence or compliance, AssemblyAI's feature set is more relevant.

Does AssemblyAI work for medical or legal transcription?

Yes, with caveats. The PII detection and redaction feature covers common identifiers like names, phone numbers, social security numbers, and medical record numbers. Speaker diarization helps separate parties in a multi-person recording. Universal-2 produces accuracy suitable for professional use cases on clear audio. For medical applications that need HIPAA-compliant infrastructure, AssemblyAI offers a Business Associate Agreement (BAA), which is required to process protected health information. Legal transcription requirements vary by jurisdiction, and most legal professionals use AI-generated transcripts as a starting point that receives human review rather than as final documents.

Related agents

Anthropic Computer Use

Claude's computer-use capability that powers desktop and browser agents

Featured

autonomouscomputer-use Paid

Deepgram

Speech-to-text API and voice agent platform built for real-time low-latency applications

speech-to-textvoice-agents Free tier

450 ★ ↑ 4.7%

E2B

Secure cloud sandboxes for running AI-generated code safely in any language

developer-toolsopen-source Free tier

13,058 ★ ↑ 5.1%