Coqui TTS
Open-source text-to-speech toolkit descended from Mozilla TTS, community-maintained after company shutdown
Coqui TTS is the leading open-source text-to-speech toolkit, descended from Mozilla TTS and maintained by an active community after Coqui the company shut down in early 2024. The library includes 30+ pre-trained models and supports voice cloning via XTTS v2. It's entirely self-hosted, which means no API fees but also no managed infrastructure. For developers who need TTS they can run on their own hardware without recurring costs or data privacy concerns, it remains the default open-source choice.
Coqui TTS is one of those projects where the company and the software have almost opposite trajectories. The company Coqui shut down quietly in early 2024. The GitHub repository for their TTS library, which they'd been building since 2020 on top of Mozilla's TTS research, kept getting stars and pull requests and new model contributions. As of mid-2026, it's still the most widely-referenced open-source TTS project in the Python ecosystem, despite the organization that built it being defunct.
This review covers what Coqui TTS actually is in its current community-maintained state, what it can do, and when it makes sense over commercial alternatives.
The history matters here
Understanding Coqui TTS requires understanding where it came from. Mozilla ran a speech research project called Mozilla TTS starting around 2018, which produced a series of open-source TTS models and a training toolkit. When Mozilla wound down Common Voice and related speech research, several of the engineers spun out to found Coqui in 2020 in Berlin, taking the lineage of Mozilla TTS with them and building a commercial product on top.
Coqui built and released a substantial open-source library alongside their commercial ambitions, including pre-trained models and training code. When the company shut down in early 2024, the commercial product disappeared but the open-source repository remained, and the community that had formed around it kept working.
The result is a library with genuine depth, real production use, and no corporate entity behind it. That's a meaningful context for any evaluation because it affects what you can rely on it for.
What the library actually provides
The Coqui TTS Python library is a toolkit, not a single model. You install it, and you get access to a collection of different architectures and pre-trained checkpoints. The most important ones to understand are:
XTTS v2 is the current flagship model for most practical use cases. It supports voice cloning from a reference audio sample and covers 17 languages. The quality is meaningfully better than earlier generations of Coqui models, and the multilingual cloning, where you clone a voice in one language and generate speech in another, is a capability that's genuinely useful for localization work without recorded multilingual talent.
YourTTS is an earlier zero-shot voice cloning model that still sees use in some deployments, particularly where XTTS v2's resource requirements are too high.
Bark integration is available, though Bark is technically a separate Suno project that Coqui provides a wrapper for. Bark has good expressiveness but is slow and unpredictable in a way that makes it poorly suited for anything with latency requirements.
Tacotron2 and FastSpeech2 are earlier architectures that trade quality for speed. They're relevant if you're running on CPU or low-memory environments and need something that finishes in reasonable time.
The library also provides training code for fine-tuning models on custom voice data, which is significant if you want to train a speaker-specific model rather than relying on zero-shot cloning.
Quality: an honest placement
Coqui TTS output quality, specifically XTTS v2 output quality, is noticeably below current commercial offerings. If you generate the same text with XTTS v2 and with ElevenLabs, and you listen to both, the ElevenLabs output will sound more natural on most voice types and content styles. The gap is real.
This doesn't mean Coqui is unusable for production. The quality is good enough for many applications: internal tooling, accessibility features where any synthesized voice beats silence, e-learning content where a slightly robotic voice is acceptable, applications where the text being read is very short.
Where it clearly falls short is customer-facing audio where naturalness affects user trust or engagement, long-form content where the listener accumulates subtle quality signals over time, and emotional delivery where prosody matters. For those use cases, ElevenLabs or Play.ht are the practical answer even if the per-character cost is real.
The voice cloning quality from XTTS v2 is similarly positioned: it works, it retains speaker identity from a reference clip, and it's substantially below what Professional Voice Cloning on ElevenLabs produces from the same source material. For prototyping and internal tools, fine. For a production voice clone you're putting in front of customers, the commercial options produce better results.
The self-hosting reality
Running Coqui TTS in production requires infrastructure you own and maintain. That means:
A GPU instance somewhere, either cloud or on-premises. XTTS v2 on CPU is too slow for real-time applications. A g4dn.xlarge on AWS or a comparable GPU instance on other clouds runs XTTS v2 acceptably. At roughly $0.50-2.00 of compute per hour of generated audio depending on instance type and model, the economics at high volume compare favorably to commercial per-character pricing.
An inference service wrapping the library. The Python library gives you a programmatic interface, but you need to build the HTTP service, handle concurrency, manage GPU memory, and deal with model loading times if you want something that scales like an API.
Model and dependency management. The library's dependencies have occasionally had conflicts between versions, and the community doesn't always move quickly on compatibility issues. Budget engineering time for setup and periodic maintenance.
This is not a complaint about Coqui specifically. It's the honest description of running any open-source ML library in production. If your team has the infrastructure engineering capacity to handle this, the cost and control advantages are real. If you don't, a managed API is the right choice regardless of the price difference.
The data privacy case
This is where Coqui's self-hosted model has a genuine advantage that no commercial alternative can match. If your application processes audio content that can't leave your infrastructure, due to regulatory requirements, client data agreements, or security policy, the choice is self-hosting or not having the capability. Coqui TTS is the most capable open-source option for that constraint.
Healthcare applications processing patient voice data, legal applications handling client communications, enterprise deployments under data residency requirements, and government applications with restricted data handling policies all potentially fall into this category. For those teams, the quality trade-off against commercial APIs might be acceptable because the alternative isn't ElevenLabs with better privacy controls, the alternative is nothing.
Post-shutdown community state
As of mid-2026, the coqui-ai/TTS repository has accumulated community contributors who are actively merging fixes and improvements. XTTS v2 still receives patches. Issues get addressed, though on a volunteer timeline rather than a commercial support timeline.
The project does not have a commercial entity offering paid support contracts. If you deploy Coqui TTS in production and run into a model bug or a critical security issue in a dependency, you're either fixing it yourself, waiting for the community, or paying a consultant to do it for you. That's the honest maintenance picture.
For research teams, startups prototyping before they have revenue to pay API costs, and organizations with data constraints, this is manageable. For teams that need guaranteed response times on production issues, the lack of commercial backing is a real risk.
When Coqui TTS makes sense
Self-hosted requirements: If your data can't go to a third-party API, Coqui TTS is the highest-quality option that runs entirely on your infrastructure.
High-volume batch processing at low cost: At sufficient scale, running your own GPU cluster for TTS generation is cheaper than per-character commercial pricing. The break-even point depends on your volume and quality requirements, but for teams generating hundreds of hours of audio monthly, the math often favors self-hosting.
Research and experimentation: If you're building something novel in the TTS space, having access to model weights, training code, and a toolkit you can modify is valuable. Commercial APIs don't give you that.
Offline and edge deployment: If you need TTS on devices without internet connectivity, Coqui TTS models can be packaged and deployed locally. Commercial APIs can't.
When it doesn't make sense
If you're building a user-facing product where voice quality affects engagement, Coqui TTS is not where you should start. The quality gap against ElevenLabs is large enough that it affects user experience in measurable ways. Starting with a managed API and evaluating whether the cost is sustainable at your scale is the more sensible path than starting with open-source and accepting a quality penalty from day one.
If you don't have GPU infrastructure or the engineering capacity to build and maintain an inference service, the apparent cost advantage evaporates quickly when you account for infrastructure and engineering time.
Getting started
Installation is pip-based:
pip install TTS
For XTTS v2 voice cloning, the basic pattern is:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.tts_to_file(
text="Your text here",
speaker_wav="reference_voice.wav",
language="en",
file_path="output.wav"
)
The first run downloads the model weights, which are substantial. XTTS v2 is around 1.8GB. Plan for that in your deployment pipeline.
The documentation at the GitHub repository covers the available models, API reference, and training pipeline. The community Discord (linked from the repo) is active for questions.
The bottom line
Coqui TTS is the right choice for a specific set of situations: self-hosted requirements, high-volume cost optimization, research, and offline deployment. For everything else, especially customer-facing voice applications where quality matters, the commercial alternatives produce better results with less operational overhead. The company shutdown doesn't change the utility of the library for the use cases it fits, but it's relevant context for any team making a long-term infrastructure decision.
Key features
- 30+ pre-trained TTS models including VITS, YourTTS, Bark, and XTTS for multi-speaker synthesis
- Voice cloning from a short reference audio sample using XTTS v2
- Multi-lingual support across 17 languages in the XTTS v2 model
- Speaker similarity fine-tuning for custom voice adaptation
- Python API and command-line interface for integration and batch synthesis
- Active community development and model contributions post-company-shutdown
- Streaming synthesis for real-time applications via the Python library
Pros and cons
Pros
- + Completely free with no usage limits beyond your own compute costs
- + XTTS v2 supports voice cloning across 17 languages from a short reference sample
- + Active community continues model development and bug fixes post-company-shutdown
- + Full control over data, model weights, and inference pipeline
- + Large model zoo covering many architectures for research and production use
- + Well-documented Python API with clean integration patterns
Cons
- − No hosted service, managed API, or commercial support
- − Requires meaningful engineering effort to deploy and maintain in production
- − Voice quality on most models trails current commercial offerings like ElevenLabs
- − GPU infrastructure is required for reasonable inference speed on quality models
- − Company is defunct, so no product roadmap, no security patches from a commercial entity
- − Community maintenance means issues and feature requests have unpredictable resolution timelines
Who is Coqui TTS for?
- Self-hosted TTS for applications with strict data privacy requirements
- Research and prototyping where API costs at scale are prohibitive
- Offline TTS deployment on edge devices or air-gapped environments
- Custom voice synthesis pipelines where full control over the model is required
Alternatives to Coqui TTS
If Coqui TTS isn't quite the right fit, the closest alternatives are elevenlabs , and play-ht . See our full Coqui TTS alternatives page for side-by-side comparisons.
Frequently Asked Questions
What is Coqui TTS?
Is Coqui TTS still maintained after the company shut down?
How does Coqui TTS voice cloning work?
How does Coqui TTS compare to ElevenLabs?
What hardware do I need to run Coqui TTS?
Related agents
Aide
Open-source AI-native IDE built on VS Code with agent-first workflows and local memory
AutoGPT
The original viral autonomous agent, now a visual builder platform
Browser Use
Open-source Python library that lets LLMs control real browsers