text-to-speechopen-source Status: deprecated

Coqui TTS

Open-source text-to-speech toolkit descended from Mozilla TTS, community-maintained after company shutdown

Coqui TTS is the leading open-source text-to-speech toolkit, descended from Mozilla TTS and maintained by an active community after Coqui the company shut down in early 2024. The library includes 30+ pre-trained models and supports voice cloning via XTTS v2. It's entirely self-hosted, which means no API fees but also no managed infrastructure. For developers who need TTS they can run on their own hardware without recurring costs or data privacy concerns, it remains the default open-source choice.

Coqui TTS is one of those projects where the company and the software have almost opposite trajectories. The company Coqui shut down quietly in early 2024. The GitHub repository for their TTS library, which they'd been building since 2020 on top of Mozilla's TTS research, kept getting stars and pull requests and new model contributions. As of mid-2026, it's still the most widely-referenced open-source TTS project in the Python ecosystem, despite the organization that built it being defunct.

This review covers what Coqui TTS actually is in its current community-maintained state, what it can do, and when it makes sense over commercial alternatives.

The history matters here

Understanding Coqui TTS requires understanding where it came from. Mozilla ran a speech research project called Mozilla TTS starting around 2018, which produced a series of open-source TTS models and a training toolkit. When Mozilla wound down Common Voice and related speech research, several of the engineers spun out to found Coqui in 2020 in Berlin, taking the lineage of Mozilla TTS with them and building a commercial product on top.

Coqui built and released a substantial open-source library alongside their commercial ambitions, including pre-trained models and training code. When the company shut down in early 2024, the commercial product disappeared but the open-source repository remained, and the community that had formed around it kept working.

The result is a library with genuine depth, real production use, and no corporate entity behind it. That's a meaningful context for any evaluation because it affects what you can rely on it for.

What the library actually provides

The Coqui TTS Python library is a toolkit, not a single model. You install it, and you get access to a collection of different architectures and pre-trained checkpoints. The most important ones to understand are:

XTTS v2 is the current flagship model for most practical use cases. It supports voice cloning from a reference audio sample and covers 17 languages. The quality is meaningfully better than earlier generations of Coqui models, and the multilingual cloning, where you clone a voice in one language and generate speech in another, is a capability that's genuinely useful for localization work without recorded multilingual talent.

YourTTS is an earlier zero-shot voice cloning model that still sees use in some deployments, particularly where XTTS v2's resource requirements are too high.

Bark integration is available, though Bark is technically a separate Suno project that Coqui provides a wrapper for. Bark has good expressiveness but is slow and unpredictable in a way that makes it poorly suited for anything with latency requirements.

Tacotron2 and FastSpeech2 are earlier architectures that trade quality for speed. They're relevant if you're running on CPU or low-memory environments and need something that finishes in reasonable time.

The library also provides training code for fine-tuning models on custom voice data, which is significant if you want to train a speaker-specific model rather than relying on zero-shot cloning.

Quality: an honest placement

Coqui TTS output quality, specifically XTTS v2 output quality, is noticeably below current commercial offerings. If you generate the same text with XTTS v2 and with ElevenLabs, and you listen to both, the ElevenLabs output will sound more natural on most voice types and content styles. The gap is real.

This doesn't mean Coqui is unusable for production. The quality is good enough for many applications: internal tooling, accessibility features where any synthesized voice beats silence, e-learning content where a slightly robotic voice is acceptable, applications where the text being read is very short.

Where it clearly falls short is customer-facing audio where naturalness affects user trust or engagement, long-form content where the listener accumulates subtle quality signals over time, and emotional delivery where prosody matters. For those use cases, ElevenLabs or Play.ht are the practical answer even if the per-character cost is real.

The voice cloning quality from XTTS v2 is similarly positioned: it works, it retains speaker identity from a reference clip, and it's substantially below what Professional Voice Cloning on ElevenLabs produces from the same source material. For prototyping and internal tools, fine. For a production voice clone you're putting in front of customers, the commercial options produce better results.

The self-hosting reality

Running Coqui TTS in production requires infrastructure you own and maintain. That means:

A GPU instance somewhere, either cloud or on-premises. XTTS v2 on CPU is too slow for real-time applications. A g4dn.xlarge on AWS or a comparable GPU instance on other clouds runs XTTS v2 acceptably. At roughly $0.50-2.00 of compute per hour of generated audio depending on instance type and model, the economics at high volume compare favorably to commercial per-character pricing.

An inference service wrapping the library. The Python library gives you a programmatic interface, but you need to build the HTTP service, handle concurrency, manage GPU memory, and deal with model loading times if you want something that scales like an API.

Model and dependency management. The library's dependencies have occasionally had conflicts between versions, and the community doesn't always move quickly on compatibility issues. Budget engineering time for setup and periodic maintenance.

This is not a complaint about Coqui specifically. It's the honest description of running any open-source ML library in production. If your team has the infrastructure engineering capacity to handle this, the cost and control advantages are real. If you don't, a managed API is the right choice regardless of the price difference.

The data privacy case

This is where Coqui's self-hosted model has a genuine advantage that no commercial alternative can match. If your application processes audio content that can't leave your infrastructure, due to regulatory requirements, client data agreements, or security policy, the choice is self-hosting or not having the capability. Coqui TTS is the most capable open-source option for that constraint.

Healthcare applications processing patient voice data, legal applications handling client communications, enterprise deployments under data residency requirements, and government applications with restricted data handling policies all potentially fall into this category. For those teams, the quality trade-off against commercial APIs might be acceptable because the alternative isn't ElevenLabs with better privacy controls, the alternative is nothing.

Post-shutdown community state

As of mid-2026, the coqui-ai/TTS repository has accumulated community contributors who are actively merging fixes and improvements. XTTS v2 still receives patches. Issues get addressed, though on a volunteer timeline rather than a commercial support timeline.

The project does not have a commercial entity offering paid support contracts. If you deploy Coqui TTS in production and run into a model bug or a critical security issue in a dependency, you're either fixing it yourself, waiting for the community, or paying a consultant to do it for you. That's the honest maintenance picture.

For research teams, startups prototyping before they have revenue to pay API costs, and organizations with data constraints, this is manageable. For teams that need guaranteed response times on production issues, the lack of commercial backing is a real risk.

When Coqui TTS makes sense

Self-hosted requirements: If your data can't go to a third-party API, Coqui TTS is the highest-quality option that runs entirely on your infrastructure.

High-volume batch processing at low cost: At sufficient scale, running your own GPU cluster for TTS generation is cheaper than per-character commercial pricing. The break-even point depends on your volume and quality requirements, but for teams generating hundreds of hours of audio monthly, the math often favors self-hosting.

Research and experimentation: If you're building something novel in the TTS space, having access to model weights, training code, and a toolkit you can modify is valuable. Commercial APIs don't give you that.

Offline and edge deployment: If you need TTS on devices without internet connectivity, Coqui TTS models can be packaged and deployed locally. Commercial APIs can't.

When it doesn't make sense

If you're building a user-facing product where voice quality affects engagement, Coqui TTS is not where you should start. The quality gap against ElevenLabs is large enough that it affects user experience in measurable ways. Starting with a managed API and evaluating whether the cost is sustainable at your scale is the more sensible path than starting with open-source and accepting a quality penalty from day one.

If you don't have GPU infrastructure or the engineering capacity to build and maintain an inference service, the apparent cost advantage evaporates quickly when you account for infrastructure and engineering time.

Getting started

Installation is pip-based:

pip install TTS

For XTTS v2 voice cloning, the basic pattern is:

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.tts_to_file(
    text="Your text here",
    speaker_wav="reference_voice.wav",
    language="en",
    file_path="output.wav"
)

The first run downloads the model weights, which are substantial. XTTS v2 is around 1.8GB. Plan for that in your deployment pipeline.

The documentation at the GitHub repository covers the available models, API reference, and training pipeline. The community Discord (linked from the repo) is active for questions.

The bottom line

Coqui TTS is the right choice for a specific set of situations: self-hosted requirements, high-volume cost optimization, research, and offline deployment. For everything else, especially customer-facing voice applications where quality matters, the commercial alternatives produce better results with less operational overhead. The company shutdown doesn't change the utility of the library for the use cases it fits, but it's relevant context for any team making a long-term infrastructure decision.

Key features

30+ pre-trained TTS models including VITS, YourTTS, Bark, and XTTS for multi-speaker synthesis
Voice cloning from a short reference audio sample using XTTS v2
Multi-lingual support across 17 languages in the XTTS v2 model
Speaker similarity fine-tuning for custom voice adaptation
Python API and command-line interface for integration and batch synthesis
Active community development and model contributions post-company-shutdown
Streaming synthesis for real-time applications via the Python library

Pros and cons

Pros

+ Completely free with no usage limits beyond your own compute costs
+ XTTS v2 supports voice cloning across 17 languages from a short reference sample
+ Active community continues model development and bug fixes post-company-shutdown
+ Full control over data, model weights, and inference pipeline
+ Large model zoo covering many architectures for research and production use
+ Well-documented Python API with clean integration patterns

Cons

− No hosted service, managed API, or commercial support
− Requires meaningful engineering effort to deploy and maintain in production
− Voice quality on most models trails current commercial offerings like ElevenLabs
− GPU infrastructure is required for reasonable inference speed on quality models
− Company is defunct, so no product roadmap, no security patches from a commercial entity
− Community maintenance means issues and feature requests have unpredictable resolution timelines

Who is Coqui TTS for?

Self-hosted TTS for applications with strict data privacy requirements
Research and prototyping where API costs at scale are prohibitive
Offline TTS deployment on edge devices or air-gapped environments
Custom voice synthesis pipelines where full control over the model is required

Alternatives to Coqui TTS

If Coqui TTS isn't quite the right fit, the closest alternatives are elevenlabs , and play-ht . See our full Coqui TTS alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is Coqui TTS?

Coqui TTS is an open-source text-to-speech library written in Python. It provides a collection of pre-trained TTS models, including XTTS v2 which supports voice cloning across 17 languages. The library descends from Mozilla TTS and was originally developed by Coqui, a Berlin-based company that shut down in early 2024. The library itself is actively maintained by its community under the Mozilla Public License 2.0 and remains the most widely-used open-source TTS toolkit.

Is Coqui TTS still maintained after the company shut down?

Yes. The company Coqui shut down in early 2024, but the GitHub repository at coqui-ai/TTS continues to receive community contributions. Pull requests are reviewed and merged, bug reports get addressed, and new models are contributed by the community. The pace of development is slower than when the company had a paid team working on it, but the library is not abandoned. For production use you should expect to own more of the maintenance burden than you would with a commercial product.

How does Coqui TTS voice cloning work?

Voice cloning in Coqui TTS uses the XTTS v2 model. You provide a short reference audio clip of the target voice, typically 6-30 seconds of clean speech, and the model conditions on that reference to generate new speech in that voice. Quality depends on the cleanliness and length of the reference audio. XTTS v2 supports 17 languages, meaning you can clone a voice in one language and generate speech in a different language using that cloned voice.

How does Coqui TTS compare to ElevenLabs?

ElevenLabs is a managed API service with higher output quality on most voice types, especially for naturalness on conversational content. Coqui TTS is a self-hosted library that's free to run but requires your own infrastructure and produces lower quality output than current commercial models. The practical trade-off is cost versus quality and operational complexity. At high volume, Coqui's compute costs can be significantly lower than ElevenLabs' per-character pricing, but you're buying that savings with engineering time and infrastructure management.

What hardware do I need to run Coqui TTS?

XTTS v2 and other quality models run on CPU but are slow enough that CPU inference isn't practical for real-time applications. A modern GPU with at least 4GB VRAM is the minimum for reasonable throughput on quality models. For production workloads generating significant audio volume, a dedicated GPU instance on any cloud provider is the standard setup. Smaller models like Tacotron2 with a Griffin-Lim vocoder can run on CPU acceptably for batch jobs where latency isn't the priority.

Related agents

Aide

Open-source AI-native IDE built on VS Code with agent-first workflows and local memory

codingide Free tier

2,193 ★ — 0.0%

AutoGPT

The original viral autonomous agent, now a visual builder platform

autonomousopen-source Free

185,628 ★ ↑ 0.8%

Browser Use

Open-source Python library that lets LLMs control real browsers

autonomousbrowser-agent Free

105,800 ★ ↑ 13.4%