AI Agent Evaluation Suites in 2026: Real Metrics That Matter

April 8, 2026 · Editorial Team · 9 min read · ai-agents evaluation testing

Shipping an AI agent without evaluation is like shipping software without tests. You'll catch the obvious failures, but the subtle ones will bite you in production. The user who gets a confidently wrong answer and never comes back. The agent that works great for 95% of queries and silently fails on the 5% that matter most to your best customers.

The good news is that the tooling for agent evaluation has gotten genuinely good over the last year. The bad news is that most teams still aren't using it consistently.

What "evaluation" actually means for agents

Evaluation for traditional ML models means measuring accuracy on a held-out test set. You have labels, you make predictions, you compute a metric. Clean, quantitative, easy to track over time.

Agent evaluation is messier because agents are doing multi-step reasoning, making tool calls, and producing outputs that are often hard to compare to a single ground-truth answer. You're evaluating not just the final output but the path the agent took to get there.

There are roughly four things you want to measure:

Output quality - Is the final response correct, relevant, and complete?

Behavioral consistency - Does the agent behave the same way across semantically equivalent inputs? Does it follow your stated rules reliably?

Tool use accuracy - When the agent calls tools, is it calling the right tool with the right arguments? Is it calling tools when it should and not calling them when it shouldn't?

Trajectory efficiency - Is the agent taking reasonable paths to answers, or is it spinning through unnecessary tool calls and reasoning loops?

Different use cases weight these differently. A coding agent where "correct" is objectively verifiable cares a lot about output quality. A customer support agent where responses need to match policy cares more about behavioral consistency. A research agent cares about trajectory efficiency because wasted steps mean wasted money.

Promptfoo: the evaluation framework for teams who want control

Promptfoo started as a tool for testing prompts and has grown into a full agent evaluation framework. The core model is that you define test cases and assertions, run them against your agent, and get structured results you can track over time.

A basic Promptfoo config looks like this:

prompts:
  - "{{system_prompt}}"

providers:
  - id: openai:gpt-4o
    config:
      temperature: 0.1

tests:
  - description: "Handles missing order number gracefully"
    vars:
      user_message: "where is my package"
    assert:
      - type: contains
        value: "order number"
      - type: not-contains
        value: "I cannot help"
      - type: llm-rubric
        value: "Response asks for clarification rather than refusing to help"

  - description: "Does not hallucinate shipping dates"
    vars:
      user_message: "When will my order 12345 arrive?"
    assert:
      - type: llm-rubric
        value: "Response doesn't state specific dates without evidence"

The llm-rubric assertion type is where this gets useful for agents. Rather than checking for exact strings, you describe the behavior you want in plain language and a grader LLM evaluates whether the response meets that criteria. This handles the fuzziness of natural language output.

Promptfoo also has a model-graded assertion type where you can write detailed rubrics, scoring the output on multiple dimensions at once.

For agent evaluation specifically, Promptfoo supports multi-turn conversations. You can define a sequence of exchanges and assert on the state of the conversation at each step, which lets you test behaviors that only emerge over multiple turns.

Where it shines: Teams that want reproducible, version-controlled evals they can run in CI. Promptfoo's YAML-based config fits naturally into a Git workflow. You can see exactly which tests changed and what effect prompt changes had on scores.

Where it's limited: Promptfoo doesn't capture production traffic natively. You're writing test cases manually, which means you might miss failure modes that only show up in real user behavior.

Langfuse: observability first, evals second

Langfuse approaches this differently. It's primarily a tracing and observability platform, but it has built evals on top of that foundation. You instrument your agent to send traces to Langfuse, and then you run evaluators against those traces.

The key difference is that Langfuse operates on production data. You're not just testing synthetic cases; you're evaluating what actually happened in production conversations.

The tracing setup is lightweight:

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def run_agent(user_message: str) -> str:
    # your agent logic here
    response = agent.run(user_message)
    return response

Once you have traces, you can run evaluators on them from the Langfuse dashboard or via API. Langfuse ships a set of default evaluators for common things (toxicity, hallucination, relevance) and lets you write custom ones.

The custom evaluator workflow is where this gets powerful. You can define evaluators that check for behaviors specific to your product. A financial agent might have evaluators that check whether responses include appropriate risk disclaimers. A coding agent might have evaluators that verify code blocks are syntactically valid before they're returned to users.

Langfuse also supports human annotation workflows. You can set up queues where human reviewers score a sample of production traces, and these scores feed into aggregate quality metrics over time. This gives you ground truth that pure LLM-graded evals can't provide.

Where it shines: Production monitoring. When you want to know not just whether your agent works in tests but whether it's working on real queries from real users, Langfuse is the natural choice.

Where it's limited: The local development experience is weaker than Promptfoo. Running eval suites against new prompts before deploying requires a bit more setup.

Helicone: lightweight evals as a side effect of monitoring

Helicone started as a simple LLM proxy that logs all your API calls. The evals feature came later and reflects that lineage: it's designed to be low-friction and to give you useful signal without requiring a lot of configuration.

You route your OpenAI or Anthropic API calls through Helicone with a one-line change to your API base URL, and it starts logging everything automatically. From there, you can set up "scorers" that run asynchronously on your logged requests.

The scoring workflow is opinionated but practical. You define scoring functions in Python that take a request/response pair and return a score between 0 and 1:

def quality_scorer(request, response):
    # Check for specific failure patterns
    response_text = response["choices"][0]["message"]["content"]
    
    if "I don't know" in response_text and request_has_documentation(request):
        return 0.0  # Agent should have used available docs
    
    if contains_hallucinated_url(response_text):
        return 0.0
    
    return 1.0

Helicone runs these scorers on a sample of production traffic and surfaces the results in a dashboard. You can see quality scores over time, broken down by prompt version, model, user segment, or whatever dimensions matter to you.

Where it shines: Getting basic production quality monitoring running with minimal setup. If you're not doing any evaluation today, Helicone is probably the fastest way to get some signal.

Where it's limited: The eval capabilities are less sophisticated than dedicated eval frameworks. Complex multi-step assertions, detailed rubrics, and systematic test suite management are better handled by Promptfoo or Langfuse.

Building custom test suites: what you actually need

Every serious agent deployment eventually needs custom evaluation that goes beyond what off-the-shelf tools provide. Here's what a well-designed custom eval suite actually contains.

A golden set of labeled examples. These are input/output pairs where a human (ideally a domain expert) has labeled what a good response looks like. Start with 50-100 examples covering your most important and most common query types. This is the ground truth your automated evals are approximating.

Automated regression tests. Every time a real failure mode shows up in production, write a test case for it. The specific query that caused the problem, what the bad response looked like, and what a good response should look like. Over time, this becomes a regression suite that prevents you from shipping the same failure twice.

Adversarial test cases. Queries specifically designed to trigger failure modes: prompt injection attempts, edge cases that break your parsing logic, queries that are in scope but look out-of-scope, queries that are out of scope but look in-scope. These are hard to write but catch failures before users do.

Behavioral consistency checks. Take the same query and paraphrase it 10 different ways. The responses don't need to be identical but they should be consistent in their essential claims, tone, and approach. Large variance across paraphrases suggests your agent is brittle.

A/B eval infrastructure. When you change your prompts, you need a way to compare the old version against the new version on the same test set. This sounds obvious but a surprising number of teams make prompt changes without any systematic comparison.

The metrics that actually predict production quality

A lot of eval dashboards show you metrics that look impressive but don't correlate well with whether users actually like the agent. Here are the metrics worth tracking:

Task completion rate. On a defined set of tasks, what fraction does the agent successfully complete? For most agents this is the primary metric. Be honest about what counts as completion.

Refusal rate. How often does the agent refuse to help with in-scope queries? High refusal rates are a common failure mode with safety-tuned models. Users don't complain about it directly; they just stop using the product.

Factual accuracy on verifiable claims. For claims you can check (dates, URLs, prices, code that can be executed), what fraction are correct? Agents that are right 90% of the time feel unreliable; the 10% error rate destroys trust disproportionately to its frequency.

Latency at the 95th percentile. Mean latency is a bad metric for agent quality because it's dragged down by simple fast queries. The 95th percentile tells you about the slow paths, which are usually the complex multi-step tasks that matter most.

Cost per successful task. If your agent costs $0.04 per query on average but $0.40 per successful complex task, you need to know that. This affects both pricing and whether the unit economics of the feature make sense.

How to run evals in CI without breaking your deploy pipeline

The practical question is how to integrate evaluation into your development workflow so it actually gets run.

The most effective pattern is a tiered eval setup:

Pre-commit (fast): A small set of 20-30 tests covering the most critical behaviors. These should run in under 2 minutes. If they fail, the commit doesn't go through.

Pre-deploy (thorough): Your full test suite, run against a staging environment. These can take 15-30 minutes and evaluate comprehensively. A deployment requires passing this gate.

Production monitoring (continuous): Lightweight scorers running on live traffic, alerting on quality drops. This is your early warning system for issues that only appear at scale.

The mistake most teams make is trying to run the full eval suite pre-commit. It's too slow and developers start finding ways to skip it. Keep the fast gate fast, run the thorough suite where it doesn't block developer iteration.

Evaluation isn't a one-time thing you set up and forget. The queries your users actually ask will always surprise you, and your eval suite needs to evolve with them. Block out time every two weeks to review production failures and add them to your test set. Teams that do this systematically end up with agents that improve continuously. Teams that don't end up with agents that feel good in demos and disappoint in production.