Agentbrisk

AI Tools Compared by Speed 2026: Tokens Per Second Across Providers

March 28, 2026 · Editorial Team · 6 min read · ai-toolsperformancecomparison

Speed in AI systems is measured in tokens per second: how fast a model generates output tokens. For short responses, the difference between fast and slow models is a few seconds. For long-form outputs, code generation, or real-time applications, it's the difference between a tool that feels fluid and one that feels like it's thinking too hard.

Here's where the major providers actually land, based on public benchmarks and direct testing as of early 2026.


The tiers: a rough map before the numbers

AI inference speed falls into three practical tiers:

Extreme speed (800-3,000+ tokens/sec): Groq and Cerebras. These use specialized hardware designed entirely around fast inference rather than flexible training. The tradeoff is that you're using specific model versions at specific sizes.

Standard fast (80-200 tokens/sec): Most frontier models from Anthropic, OpenAI, and Google in normal serving conditions. This is "fast enough for almost everything" territory.

Slow/reasoning-heavy (10-50 tokens/sec): Extended reasoning modes, large model variants, and tasks where the model is spending time "thinking" before output.


Groq: the speed leader

Groq runs models on their Language Processing Units (LPUs), custom silicon designed specifically for transformer inference. The result is a speed profile that looks nothing like standard GPU-based inference.

Measured throughput on Groq's API:

  • Llama 3.3 70B: approximately 800-1,100 tokens/sec
  • Llama 3.1 8B: approximately 1,500-2,500 tokens/sec
  • Mixtral 8x7B: approximately 500-700 tokens/sec

These numbers come from Groq's own benchmarks and third-party testing. Real-world throughput varies based on prompt length, concurrent requests, and output token length, but the speed advantage over GPU-based providers is large enough to survive any variance.

What you give up with Groq: model choice. You're using open-weight models (Llama variants, Mixtral) rather than frontier proprietary models. Llama 3.3 70B is excellent and competitive with mid-tier GPT-4 performance for many tasks. But it's not Claude 3.7 Sonnet or GPT-4o. For applications where raw speed matters more than peak capability, such as real-time conversational interfaces, streaming completions, or high-volume generation tasks, Groq is hard to beat.

Pricing: Groq's API is cheap. Llama 3.3 70B costs around $0.59/million input tokens and $0.79/million output tokens. For comparison, that's roughly 5x cheaper than frontier model options while being dramatically faster.


Cerebras: Groq's direct competitor

Cerebras uses a different architectural approach (their Wafer Scale Engine, a chip the size of an entire silicon wafer) but achieves a similar result: extremely fast transformer inference.

Measured throughput on Cerebras' API:

  • Llama 3.3 70B: approximately 1,000-2,000 tokens/sec
  • Llama 3.1 8B: approximately 2,000-3,600 tokens/sec

Cerebras slightly edges out Groq at benchmark conditions for some model sizes, and the two trade leadership depending on the specific model and load conditions. In practice, both are dramatically faster than standard GPU-based inference.

Like Groq, Cerebras serves open-weight models. The model selection is similar (Llama-family models primarily), and the pricing is comparable.

Groq vs Cerebras for speed-critical applications: Both work. Groq has a larger model catalog and slightly better documentation for production use. Cerebras has strong competition on raw throughput numbers. If you're evaluating both, benchmark your specific use case rather than relying on general numbers, because the workload characteristics affect which one wins for your application.


Anthropic: Claude's speed profile

Claude's models are served through Anthropic's infrastructure on standard GPU clusters. Speed varies by model:

  • Claude 3.5 Haiku: approximately 180-250 tokens/sec on the API
  • Claude 3.7 Sonnet: approximately 80-140 tokens/sec
  • Claude 3.7 Sonnet with extended thinking: approximately 15-40 tokens/sec during the thinking phase

Haiku is Anthropic's speed-optimized model. It's not as capable as Sonnet for complex tasks, but for classification, simple Q&A, extraction, and other tasks where Sonnet is overkill, Haiku gives you faster throughput at lower cost.

Extended thinking mode dramatically reduces effective output speed because the model generates internal reasoning tokens (which you often don't see) before producing the final answer. A task that takes 5 seconds in normal mode might take 30 seconds in extended thinking mode. The tradeoff is reasoning quality, extended thinking performs meaningfully better on hard math, logic, and complex multi-step tasks.

For latency-sensitive applications using Claude, Haiku is the right choice. For quality-sensitive tasks where speed is secondary, Sonnet with extended thinking is worth the wait.


OpenAI: GPT-4o speed characteristics

GPT-4o is OpenAI's current primary offering, balancing capability and speed:

  • GPT-4o: approximately 100-180 tokens/sec
  • GPT-4o mini: approximately 200-300 tokens/sec
  • o3 / o3-mini reasoning models: 20-60 tokens/sec for output (variable based on "thinking" budget)

GPT-4o mini is OpenAI's speed-optimized option, analogous to Claude Haiku. For tasks that don't require the full GPT-4o capability, mini gets you approximately 2x the token output speed at a fraction of the cost.

The o-series reasoning models are deliberately slow. The "thinking" budget controls how long the model reasons before answering, and longer thinking produces better answers on hard problems at the cost of latency. For interactive use, o3-mini with a low thinking budget is manageable. For research or analysis tasks where you're waiting anyway, the extra latency is worth it.


Google: Gemini's speed tier

Gemini models vary significantly:

  • Gemini 2.0 Flash: approximately 200-350 tokens/sec
  • Gemini 2.5 Pro: approximately 80-150 tokens/sec
  • Gemini 1.5 Flash: approximately 250-400 tokens/sec

Gemini Flash variants are genuinely fast for frontier models. Flash 2.0 is competitive with GPT-4o mini and Claude Haiku on speed while being more capable than either on many benchmarks. For applications that need a capable model fast, Gemini 2.0 Flash is one of the better options in the standard-GPU tier.

Gemini 2.5 Pro is slower because it's larger and more capable. Its extended thinking mode (similar to Claude's) adds additional latency when enabled.


First-token latency vs throughput

The numbers above are all throughput (how fast tokens arrive once output starts). First-token latency is a different measurement: how long until you see the first word.

For short, real-time interactions, first-token latency matters more than throughput. A model that starts responding in 0.3 seconds and then produces at 100 tokens/sec often feels faster than one that waits 2 seconds to start and then runs at 200 tokens/sec.

First-token latency by provider (approximate, varies with load):

  • Groq: 150-300ms
  • Cerebras: 150-400ms
  • Anthropic (Haiku): 400-800ms
  • OpenAI (GPT-4o mini): 400-700ms
  • Google (Gemini Flash): 300-600ms
  • Anthropic (Sonnet): 600-1,200ms
  • OpenAI (GPT-4o): 700-1,500ms

For chatbots, voice interfaces, or any application where responsiveness is part of the experience, first-token latency should be part of your evaluation.


When speed actually matters

Not every use case is speed-limited. Here's an honest look at when to prioritize speed:

Real-time conversation: If you're building something conversational, whether it's a chatbot, voice agent, or live coding assistant, latency shapes the entire experience. Users notice waits above 1-2 seconds in interactive contexts. For these applications, Groq or Cerebras for less-complex tasks, or Gemini Flash/Claude Haiku for tasks requiring more capable models, is the right architecture.

High-volume generation: If you need to process thousands of documents, generate content at scale, or run inference in batch, throughput is what limits you. Groq and Cerebras can handle 10x more tokens per second per dollar than standard GPU-based providers. For batch jobs, this translates directly to faster completion times and lower costs.

Streaming interfaces: Many AI interfaces stream tokens as they're generated, so the user sees the output building word by word. In streaming interfaces, throughput determines how quickly the full response appears. Users find high-throughput streaming satisfying even for long responses because they see progress continuously.

When speed doesn't matter:

  • Asynchronous tasks where you're waiting anyway (overnight batch jobs, background analysis)
  • Extended reasoning tasks where thinking time is the point
  • Low-volume, high-quality work like client deliverables or complex research

The practical decision

For developer teams choosing inference providers, the decision often comes down to:

Use Groq or Cerebras when you need maximum speed and open-weight models are adequate for your task. Ideal for real-time applications, high-volume batch processing, and cost-sensitive inference.

Use Anthropic, OpenAI, or Google when you need frontier model capability. Use their fast variants (Haiku, GPT-4o mini, Gemini Flash) when speed matters alongside capability. Use their full models when quality is the priority.

Hybrid architectures are common in production: a fast small model for initial responses or routing, with a call to a larger model for complex queries that need it. This pattern gets you the speed of fast inference for most requests while maintaining quality where it counts.

For a look at which models handle very long documents, the AI tools comparison by context length covers the token window side of the same question.

Search