Load Testing AI Agents in 2026: Locust, k6, and Custom Approaches

April 2, 2026 · Editorial Team · 7 min read · ai-infrastructure load-testing performance

Load testing a traditional REST API is straightforward. You spin up Locust or k6, write a test that fires requests at your endpoint, and ramp up concurrency until something breaks or you hit your target throughput. The pass/fail condition is usually latency and error rate staying within acceptable bounds.

Load testing an LLM-driven service is a different exercise. Your bottleneck isn't usually your own infrastructure. The latency profile is weird. The "correct" behavior can degrade under load in ways that HTTP error rates don't capture. And you're paying per token, so a realistic load test can cost real money.

This guide covers what's different about load testing AI agents and how to set up practical tests with Locust, k6, and custom approaches.

What you're actually testing

Before writing test scripts, be clear about what you're trying to learn.

Provider throughput limits. OpenAI, Anthropic, and other providers have rate limits measured in requests per minute (RPM) and tokens per minute (TPM). These limits vary by model and by your account tier. Understanding where you hit these limits tells you whether you need a higher tier, multiple API keys, or a queuing strategy.

Your own system's behavior under concurrency. If you have middleware, caching, routing logic, or a queue between your application and the LLM provider, that infrastructure has its own performance profile. Load testing catches bottlenecks here.

Latency distribution at scale. The p99 latency at 10 concurrent users is often much lower than at 100 concurrent users. Understanding this curve tells you what your users will experience at different load levels.

Cost per request at scale. If you have prompt caching enabled, cache hit rates often improve under sustained load (more requests that share cache-eligible prefixes). Load testing with cost tracking tells you your real per-request cost at production volume, not just the theoretical cost from token math.

What you're generally not testing: whether the model gives correct answers under load. Model quality is stable across load levels (the provider handles that). You're testing infrastructure, not intelligence.

Why standard tools need adaptation

Standard load testing tools send HTTP requests and measure HTTP responses. For LLM-driven agents, several things complicate this.

Streaming responses. If your API uses streaming (Server-Sent Events or chunked transfer), the HTTP request "completes" when the stream starts, not when the last token arrives. Standard load testing tools measure time-to-first-response, which is usually TTFT (time to first token). That's one important metric, but for most applications you also care about total response time. You need to consume the full stream and time it end-to-end.

Variable response length. A query that produces a 50-token response takes much less time than one that produces 1000 tokens. This makes "average latency" almost meaningless without controlling for response length. Your load test requests should have roughly known expected output lengths, or you should bucket your latency measurements by response length.

Non-deterministic outputs. You can't validate response bodies the same way you would for a normal API. You can't assert that the body equals a known string. You need content validators that check for structural properties: valid JSON, presence of required fields, response length in expected range.

The real cost. A realistic load test against a capable model like claude-3-5-sonnet at 1000 requests with 1000 tokens each can easily cost $30-100. This isn't a reason not to test, but it means you should design your tests carefully and not fire them repeatedly during development.

Using Locust for LLM load testing

Locust is Python-based, which makes it easy to work with LLM SDKs. The basic structure for a non-streaming test looks like this:

from locust import HttpUser, task, between
import time

class LLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def ask_question(self):
        start = time.time()
        response = self.client.post("/api/ask", json={
            "message": self.pick_realistic_query(),
            "user_id": f"loadtest-{self.user_id}"
        })
        total_time = time.time() - start
        
        if response.status_code == 200:
            data = response.json()
            tokens = data.get("usage", {}).get("total_tokens", 0)
            self.environment.events.request.fire(
                request_type="POST",
                name="/api/ask",
                response_time=total_time * 1000,
                response_length=len(response.content),
                exception=None
            )

The key additions over a standard Locust test are:

First, use a realistic query corpus, not random strings. Fetch a set of real-looking queries from your production logs or create a diverse set manually. The queries should have similar token lengths to your real production traffic, because model latency scales with output length.

Second, track custom metrics: token usage per request, cost (you can calculate this from token counts and known pricing), and cache hit rate if you're using prompt caching.

Third, handle rate limit errors (HTTP 429) explicitly. Locust's default behavior on 429 is to count them as failures and continue. You often want to back off and retry instead, which requires a custom task implementation.

For streaming endpoints, you need to consume the stream in your Locust task and measure end-to-end time:

@task
def stream_question(self):
    start = time.time()
    with self.client.post("/api/stream", 
                           json={"message": self.pick_query()},
                           stream=True) as response:
        content = ""
        for chunk in response.iter_content(chunk_size=None):
            content += chunk.decode()
        total_time = time.time() - start

Using k6 for LLM load testing

k6 is JavaScript-based and has good support for custom metrics, which makes it well-suited for tracking per-test token usage and cost.

The typical k6 setup for an LLM endpoint:

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Trend, Counter, Rate } from 'k6/metrics';

const latencyTrend = new Trend('llm_latency');
const tokenCounter = new Counter('total_tokens');
const successRate = new Rate('success_rate');

export let options = {
  stages: [
    { duration: '2m', target: 10 },
    { duration: '5m', target: 50 },
    { duration: '2m', target: 0 },
  ],
  thresholds: {
    'llm_latency': ['p(99)<15000'],  // 15 second p99
    'http_req_failed': ['rate<0.01'],
    'success_rate': ['rate>0.95'],
  },
};

export default function() {
  const start = Date.now();
  const res = http.post('/api/ask', JSON.stringify({
    message: getRealisticQuery(),
  }), {
    headers: { 'Content-Type': 'application/json' },
  });
  
  const elapsed = Date.now() - start;
  latencyTrend.add(elapsed);
  
  const success = check(res, {
    'status is 200': (r) => r.status === 200,
    'response has content': (r) => {
      const body = JSON.parse(r.body);
      return body.content && body.content.length > 10;
    },
  });
  
  successRate.add(success);
  
  if (res.status === 200) {
    const body = JSON.parse(res.body);
    tokenCounter.add(body.usage?.total_tokens || 0);
  }
  
  sleep(Math.random() * 2 + 1);
}

The thresholds block is where k6 shines. You define pass/fail criteria, and k6 exits non-zero if any threshold is violated. This integrates cleanly into CI pipelines: run a moderate load test on every deployment, fail the pipeline if p99 latency exceeds your SLA.

Custom approaches for complex agents

For multi-step agents, standard HTTP load testing only tells part of the story. A single user interaction might involve multiple API calls, tool executions, and state management. You need to test the full interaction, not just the HTTP endpoint.

One pattern is direct SDK testing rather than HTTP testing:

import asyncio
import anthropic
from concurrent.futures import ThreadPoolExecutor
import time

client = anthropic.Anthropic()

async def run_agent_interaction(query: str) -> dict:
    start = time.time()
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )
    return {
        "latency": time.time() - start,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }

async def load_test(concurrency: int, num_requests: int):
    queries = load_query_corpus()
    tasks = [
        run_agent_interaction(queries[i % len(queries)])
        for i in range(num_requests)
    ]
    # Run with controlled concurrency
    semaphore = asyncio.Semaphore(concurrency)
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

This approach is messier to run than a dedicated load testing tool, but it lets you test the actual agent logic end-to-end, not just the HTTP interface.

What good test design looks like

A few principles that save time:

Use real queries from production. Synthetic queries that don't match your actual traffic distribution will give you misleading latency and token usage numbers. Export a sample from your production logs (anonymized if necessary) and use that.

Test at realistic inter-arrival times. If your production traffic averages 10 requests per minute, don't run a test with 10 concurrent users firing as fast as possible. That's not what production looks like. Model realistic think time and request spacing.

Run ramp-up tests, not just sustained load. A sudden spike from 0 to 100 users tests different things than a gradual ramp. Both happen in production (marketing emails cause spikes; organic growth is a ramp). Test both.

Track cost as a test output. Sum the tokens across your load test and calculate what that would cost in production. If your test ran 1000 requests at your expected production volume for 10 minutes and cost $8.50, you can extrapolate monthly costs. This is more accurate than theoretical token math.

Don't load test against production provider endpoints without understanding your rate limits. If you're on a tier with 1000 RPM and you fire 2000 RPM in a test, you'll generate a wave of 429s that might affect your production traffic during the test. Run load tests during low-traffic windows or against separate API keys dedicated to testing.

The results of a load test feed directly into your capacity planning and alerting thresholds. Once you know your p99 latency curve under load, you can set meaningful SLAs for your monitoring dashboards.