AI Agent Observability Stack 2026: Langfuse vs LangSmith vs Helicone vs Phoenix

April 10, 2026 · Editorial Team · 7 min read · ai-infrastructure observability llm-ops

When your agent starts doing something wrong, you need to know what happened and when. Not a vague sense that outputs are degrading. Not a webhook that fired after 200 users already hit the broken path. You need to see the full trace: which prompt ran, which model version responded, how many tokens that consumed, what the latency was at each hop, and where exactly things went sideways.

That's what observability gives you. And in 2026, the tooling has matured to the point where you have real choices, each with genuine tradeoffs.

This piece compares the four tools that come up most in serious agent deployments: Langfuse, LangSmith, Helicone, and Arize Phoenix. These aren't the only options, but they're the ones worth evaluating first.

What you actually need from an observability platform

Before comparing tools, let's be clear about what LLM observability actually means. It's not the same as standard application monitoring, even though it overlaps with it.

You need span-level tracing of multi-step agent workflows. A single user interaction might involve a planner LLM call, three tool executions, two more LLM calls to process tool outputs, and a final synthesis call. You want to see all of that as a connected trace, not five isolated log lines.

You need token tracking with cost attribution. Not just "this call cost X tokens" but "this user's session cost $0.43 and they've made 12 sessions this month." That means per-user, per-trace cost rollups, not just raw token counts.

You need evaluation hooks. The ability to score outputs, either with human feedback, automated LLM-as-judge, or both, and to track how quality changes over time as you update prompts or switch models.

You need prompt management integration. When you change a prompt and quality drops, you want to know. When you A/B test two prompt versions, you want statistical significance, not a gut feeling.

Some of these needs are better served by specific tools. Here's how each one stacks up.

Langfuse

Langfuse started as an open-source tracing project and has grown into a full observability and prompt management platform. It's probably the most widely deployed LLM observability tool right now for teams that want self-hosting as an option.

What it does well: The tracing model is excellent. You get spans, generations, events, and scores as first-class entities, and the SDK makes it easy to instrument both direct API calls and framework-based chains. LangChain, LlamaIndex, OpenAI SDK, Anthropic SDK, and custom setups all work. The web UI lets you drill from a session down to individual token usage in a single click.

Prompt management is a genuine standout feature. You version your prompts in Langfuse, fetch them at runtime (so deploys don't require code changes to update a prompt), and every generation gets linked back to the exact prompt version that ran. When something breaks, you know which version of the prompt was live.

Pricing: Langfuse Cloud has a free tier with 50,000 observations per month. The Pro plan is $59/month per project and bumps you to higher limits. Enterprise pricing is negotiated. Self-hosted Langfuse is MIT licensed, so you can run it on your own infrastructure at no licensing cost, though you pay for the infrastructure itself.

Where it falls short: The evaluation UX is functional but not great. Setting up automated evals requires more configuration than it should. The dataset and evaluation features work, but they feel like they were added on top of a tracing-first system rather than designed from the ground up.

LangSmith

LangSmith is LangChain's observability product. If you're already building with LangChain or LangGraph, the integration is almost zero-friction. You set an environment variable, and traces start appearing.

What it does well: The LangChain ecosystem integration is unmatched. LangGraph traces in particular are beautifully rendered, you can see the full agent graph execution with each node's inputs and outputs. For teams invested in the LangChain ecosystem, this is genuinely compelling.

The playground feature lets you load any traced run and replay it with modified inputs or system prompts. That's useful for debugging: you find a bad output, click into it, tweak the prompt, and rerun immediately without touching code.

Dataset management and eval pipelines are first-class. LangSmith was built with evaluation in mind, and it shows. You can create datasets from production traces, run automated evals against them, and track metrics over time as you ship changes.

Pricing: LangSmith has a Developer plan at $39/month for up to 25,000 traces per month. The Plus plan is $99/month per seat with higher limits. Enterprise pricing is separate. There's no self-hosted option outside of LangChain's enterprise agreements.

Where it falls short: The ecosystem lock-in goes both ways. If you're not using LangChain, the integration is more manual than Langfuse. The pricing per-seat model also gets expensive fast for larger teams.

Helicone

Helicone takes a fundamentally different approach. Instead of an SDK you instrument your code with, it's a proxy. You route your OpenAI (or Anthropic, or other provider) calls through Helicone's URL instead of the provider's URL directly, and it logs everything automatically with no code changes.

What it does well: The zero-instrumentation setup is the whole value proposition. If you have an existing codebase and want observability now, without a refactor, Helicone is genuinely the fastest way to get there. Change one line (the base URL), and you have logs, costs, latency data, and basic analytics.

The cost tracking UI is polished. Helicone's dashboards show cost over time, cost by model, cost by user, and cost by custom property (which you add via headers). For finance-first teams that need to show LLM costs in budget reviews, this is the cleanest output.

Helicone also has a caching feature built in, which is something the other tools don't do. Frequent identical prompts get cached at the proxy layer, which can meaningfully cut costs for certain use cases.

Pricing: Free tier covers up to 100,000 requests per month. The Pro plan is $20/month for 1 million requests. Teams plan is $200/month with more seats and features. Helicone also offers a self-hosted option.

Where it falls short: The proxy architecture is a latency tradeoff. You're adding a network hop. Helicone publishes p99 latency overhead numbers (typically 30-50ms), but for latency-sensitive applications, that's real. Also, the tracing depth for complex multi-step agents doesn't match what Langfuse or LangSmith offer. If you have a 12-step agent workflow, Helicone shows you 12 logged requests, not a unified trace with parent-child relationships.

Arize Phoenix

Phoenix is the open-source product from Arize AI, the MLOps observability company. It's designed around the concept of datasets, experiments, and evaluations rather than just traces. The framing is more "eval-first" than "trace-first."

What it does well: Phoenix's evaluation capabilities are the strongest of the four. It ships with a library of pre-built eval templates for things like RAG relevance, hallucination detection, toxicity, and response quality. The experiment interface lets you run an eval against a dataset, compare two versions side by side, and get statistical summaries of improvement or regression.

For teams doing active prompt engineering and A/B testing, Phoenix's experiment paradigm feels more natural than the trace-centric UI of the other tools.

The OpenInference standard that Phoenix uses for spans is becoming a broader community standard, which means you get interoperability with other tools that adopt it.

Pricing: Phoenix is fully open-source (Apache 2.0 licensed). The hosted version, Arize AI, starts at free for limited scale and goes to enterprise pricing for higher volumes. For self-hosted, it's free.

Where it falls short: Self-hosting Phoenix requires more infrastructure work than self-hosting Langfuse. The tracing UI is less polished than LangSmith's for complex agent workflows. If your team isn't already doing formal eval pipelines, Phoenix's eval-centric model can feel like more overhead than you need.

How to choose

Here's an honest decision tree.

If you're using LangChain or LangGraph, start with LangSmith. The integration is trivially easy and the LangGraph visualization alone is worth the setup.

If you want self-hosting and full data ownership, Langfuse is the right call. The self-hosted version is production-quality, and the prompt management features are good enough to make it your prompt registry too.

If you have an existing codebase and just need logs and cost tracking fast, Helicone's proxy approach will get you up and running in 30 minutes without touching your application logic.

If your team is evaluation-heavy and already thinks in terms of datasets and experiments, Phoenix's model fits how you work.

Many teams end up running two: Helicone at the proxy layer for cost tracking and basic analytics, plus Langfuse or LangSmith for deep tracing on the critical paths. That's not wasteful, it's different data at different levels.

The pieces most teams forget

Observability data is only useful if you actually look at it. The tools above all give you dashboards, but you should also set up alerts. Cost per day exceeding a threshold. Latency p95 above your SLA. Error rate on specific agent paths exceeding a baseline. These alerts exist in all four platforms, and almost nobody configures them.

The other piece is tagging. All these tools let you attach metadata to traces: user ID, session ID, environment, feature name, experiment variant. Teams that instrument this from day one have dramatically more useful data than teams that add it later. It's five extra lines of code, and six months from now you'll be grateful you did it.

For deeper coverage of what to put on your dashboards once you have a tracing platform in place, the AI agent monitoring dashboards guide covers the specific metrics and alert thresholds that matter most.