Langfuse
Open-source LLM observability, evaluation, and prompt management for agent debugging and cost tracking
Langfuse is an open-source platform for LLM observability, evaluation, and prompt management. It traces agent runs at the span level, tracks token costs across providers, manages prompt versions with staging environments, and runs evaluations using LLM-as-judge or custom scorers. Available as a self-hosted MIT-licensed installation or a managed cloud service with a free hobby tier.
You ship an agent, it goes live, and users start reporting weird outputs. You open your logs. You see the final answer. You have no idea how the agent got there. What did the retrieval step return? What did the model see in the context window? Which tool call failed silently? Which prompt variant was active at the time?
This is the observability gap that Langfuse fills. It's not a framework for building agents. It's the layer that makes debugging, measuring, and improving agents possible once they're running.
What Langfuse is
Langfuse is an open-source observability, evaluation, and prompt management platform for LLM applications. It was founded in 2023 and released under the MIT license, and it has reached over 13,000 GitHub stars at langfuse/langfuse. The platform is available as a managed cloud service and as a self-hosted deployment. Both offer the same feature set; the difference is who manages the infrastructure.
The core product covers four areas that every production LLM application eventually needs:
- Distributed tracing for recording exactly what happened during an agent run
- Cost and token tracking for understanding what you're spending and where
- Prompt management for versioning and deploying prompt templates without code deploys
- Evaluation for measuring output quality at scale, not just during manual testing
These four things are related. You can't run good evaluations without good traces. You can't manage prompt versions meaningfully if you can't see which version produced which outputs. Langfuse bundles them because teams that need one usually need all of them.
Tracing: span-level visibility
The tracing model in Langfuse follows the OpenTelemetry mental model: a trace represents one complete agent run or request, composed of spans for each step. Spans are nested: a top-level trace might have spans for retrieval, reranking, and generation, and the generation span might have sub-spans for each tool call the LLM made.
Every span records:
- Input and output at that step
- Timing (start, end, latency)
- Model used and provider
- Token counts (prompt, completion, total)
- Calculated cost based on Langfuse's provider pricing tables
- Any metadata, tags, or custom attributes you add
The result is a complete picture of what happened in a run. When an agent produces a wrong answer, you can trace exactly which retrieval result contained the bad information, which prompt the model saw, and which tool call returned a value the model misinterpreted. This is the difference between a 5-minute debug session and a 2-hour one.
Adding tracing to your application
Langfuse supports three integration paths:
SDK-based instrumentation using the Python or TypeScript SDK:
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse()
@observe()
def retrieve_documents(query: str) -> list[str]:
# your retrieval logic
return ["doc1 content", "doc2 content"]
@observe()
def generate_answer(context: list[str], question: str) -> str:
langfuse_context.update_current_observation(
input={"question": question},
metadata={"context_length": len(context)}
)
# your LLM call
return "the answer"
@observe(name="rag-pipeline")
def answer_question(question: str) -> str:
docs = retrieve_documents(question)
return generate_answer(docs, question)
The @observe() decorator creates a span automatically. Nesting decorators creates nested spans. You call langfuse_context.update_current_observation() to add metadata, scores, or custom input/output overrides to the current span.
Native framework integrations where you configure a Langfuse callback handler and the framework instruments itself:
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler(
public_key="your-public-key",
secret_key="your-secret-key",
)
# LangChain
chain.invoke({"input": "your question"}, config={"callbacks": [langfuse_handler]})
# LlamaIndex
from llama_index.core import Settings
Settings.callback_manager.add_handler(langfuse_handler)
OpenTelemetry ingestion for systems that already emit OTEL spans. Langfuse exposes an OTEL-compatible endpoint, so any framework or service that uses OTEL tracing can send data to Langfuse without the native SDK. This matters for polyglot systems where not everything runs Python or TypeScript.
Cost tracking
Every span that involves an LLM call gets a cost estimate based on token counts and Langfuse's provider pricing tables. The tables cover OpenAI, Anthropic, Cohere, Mistral, Google, and many other providers. You can also add custom model definitions with your own per-token rates, which matters if you're running inference through a custom endpoint or a provider Langfuse doesn't list by default.
The cost data is aggregated across projects, models, users, and time periods in the dashboard. You can answer questions like:
- What percentage of our total cost comes from GPT-4o versus Claude Haiku?
- Which users are driving the most expensive sessions?
- Did our cost per query increase after we switched to a different retrieval strategy?
- What's our average cost per successful task completion?
For teams building on top of paid LLM APIs, this visibility is genuinely useful for budget planning and for making the case that switching models is worth the quality tradeoff.
Prompt management
Prompt management is the feature that gets underestimated until you've felt the pain of the alternative. Without a prompt registry, your prompts live in code. Updating a prompt requires a code change, a PR, a review, and a deploy. If you want to test different prompt variants, you're managing that in code too. If something goes wrong in production, you can't roll back just the prompt without rolling back the code.
Langfuse's prompt management decouples prompts from code:
from langfuse import Langfuse
langfuse = Langfuse()
# Fetch the production-labeled version of a prompt
prompt = langfuse.get_prompt("rag-system-prompt")
# Use the template with variable substitution
system_message = prompt.compile(context="relevant docs", user_name="Alice")
The prompt template lives in Langfuse. You label a version as "production" in the UI. Your application fetches that label at runtime. To update the prompt, you edit the template in Langfuse and relabel the new version as production. No code change. No deploy. Rollback is relabeling the previous version.
Langfuse also links prompt versions to the traces that used them, so you can query "what outputs did version 12 of this prompt produce" and compare quality across versions. This is the data that A/B testing prompt variants actually needs.
Evaluation
Langfuse's evaluation system runs assessments on traces and stores scores alongside the trace data. There are three evaluation approaches:
LLM-as-judge runs a separate LLM call to score each trace against a rubric:
from langfuse import Langfuse
langfuse = Langfuse()
# Score a trace after the fact
langfuse.score(
trace_id="trace-abc-123",
name="answer-relevance",
value=0.85,
comment="Answer correctly addresses the question with appropriate detail"
)
You can set up automatic evaluators that run on new traces as they arrive, using an LLM to score relevance, faithfulness, or any custom dimension you define. This turns manual spot-checking into an automatic pipeline.
Dataset evaluations run your application against a labeled set of inputs and expected outputs, storing each run's results as scores. When you change a prompt or switch models, you run the dataset eval and compare the new scores to the baseline. This is regression testing for LLM quality.
Human annotation lets you build a review queue where team members label traces directly in the Langfuse UI. Human scores are stored as evaluations alongside automated scores, which lets you calibrate automated evaluators against human judgment.
The eval tooling is functional but not the deepest implementation available. Dedicated eval platforms like Braintrust offer more sophisticated statistical analysis and testing primitives. For teams who want evaluation as part of a unified observability tool rather than a separate product, Langfuse's built-in eval is a reasonable tradeoff.
Self-hosting
The self-hosting path is one of Langfuse's genuine advantages over LangSmith. Because the license is MIT, you can run the full platform on your own infrastructure with no per-event fees and no data leaving your network.
The standard deployment is Docker Compose:
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d
This runs the Langfuse server and a Postgres database locally. For production, the recommended setup adds a managed Postgres instance (RDS, Cloud SQL, or Supabase) and deploys the Langfuse server behind a load balancer. The official Helm chart handles Kubernetes deployments.
The operational surface is minimal: Langfuse is a stateless web app backed by Postgres. If you can manage a Postgres database, you can self-host Langfuse. The main considerations are:
- Postgres storage for traces, which grows with usage. Plan for data retention policies.
- The Langfuse server container, which is stateless and horizontally scalable.
- Background workers for async evaluation jobs, which run as a separate container.
Data residency is the most common reason teams choose self-hosting. If your traces contain personally identifiable information, sensitive business logic, or data that must stay in a specific region, self-hosting is the straightforward solution.
Langfuse vs LangSmith
This is the comparison most teams make. The honest summary:
LangSmith wins if you're building on LangChain. The integration is tighter, the UI is tuned for LangChain primitives, and the evaluation tooling has more polished workflows. LangSmith also has a free development tier, though its self-hosting requires an enterprise license.
Langfuse wins if you need self-hosting, framework agnosticism, or cost predictability. The MIT license means self-hosting is genuinely free. The OTEL ingestion means any framework works. The pricing on the cloud tier is simpler and cheaper for most workloads.
For teams using LangGraph or AutoGen, both products work well. For teams using frameworks outside the LangChain ecosystem, DSPy, Griptape, or direct API calls, Langfuse's framework-agnostic approach is the cleaner path.
Integrating with specific frameworks
Beyond LangChain and LlamaIndex, Langfuse has documented integrations for:
- OpenAI SDK (Python and JS)
- Anthropic SDK
- Haystack pipelines
- AWS Bedrock via OTEL
- Vercel AI SDK
- Dify (via API)
- Any framework that emits OTEL traces
For frameworks without a dedicated integration, the low-level SDK lets you create traces and spans manually. It's more instrumentation work, but it's not significantly harder than adding structured logging to any other application.
Pricing and running costs
The cloud tiers as of May 2026:
- Hobby: Free, 50,000 observations/month, 30-day data retention, 1 project
- Pro: $59/month, 200,000 observations included + $10 per additional 100k, 90-day retention, unlimited projects
- Team: $499/month, higher limits, SSO, dedicated support
For self-hosted deployments, the only costs are compute and storage for your Postgres database and Langfuse containers. For a moderate-sized production deployment, that's typically $30-100/month on a managed cloud provider, depending on trace volume and retention.
The math usually favors self-hosting for teams with data residency requirements or trace volumes above 200,000 per month. It favors cloud for teams who want zero infrastructure overhead and whose data can live in a US-hosted SaaS.
The verdict
Langfuse has earned its position as the default answer for teams that need LLM observability without LangChain lock-in. The combination of span-level tracing, prompt versioning, and built-in evaluation in a self-hostable MIT-licensed package is genuinely compelling. The integration breadth is wide enough that "does Langfuse work with X" is almost always a yes.
The tradeoffs are real: the eval tooling is less sophisticated than dedicated eval platforms, self-hosting adds operational overhead, and the alert system requires external tooling. But for most teams, those are acceptable limitations against the core value: knowing what your agents actually did and being able to improve them systematically.
If you're running agents in production and you don't have visibility into what's happening inside them, Langfuse should be the first tool you add to your stack.
Key features
- Distributed tracing for agents, chains, and LLM calls with span-level detail
- Cost and token tracking across providers with per-model pricing tables
- Prompt management with versioning, staging, and A/B testing
- Dataset-based evaluations with LLM-as-judge and custom scorers
- Session and user tracking for multi-turn conversation analysis
- Native integrations for LangChain, LlamaIndex, OpenAI SDK, and 30+ frameworks
- OpenTelemetry-compatible ingestion for custom instrumentation
- Self-hostable on Docker or Kubernetes with MIT license