TypeScript MIT observabilityevaluation

Langfuse

Open-source LLM observability, evaluation, and prompt management for agent debugging and cost tracking

Langfuse is an open-source platform for LLM observability, evaluation, and prompt management. It traces agent runs at the span level, tracks token costs across providers, manages prompt versions with staging environments, and runs evaluations using LLM-as-judge or custom scorers. Available as a self-hosted MIT-licensed installation or a managed cloud service with a free hobby tier.

You ship an agent, it goes live, and users start reporting weird outputs. You open your logs. You see the final answer. You have no idea how the agent got there. What did the retrieval step return? What did the model see in the context window? Which tool call failed silently? Which prompt variant was active at the time?

This is the observability gap that Langfuse fills. It's not a framework for building agents. It's the layer that makes debugging, measuring, and improving agents possible once they're running.

What Langfuse is

Langfuse is an open-source observability, evaluation, and prompt management platform for LLM applications. It was founded in 2023 and released under the MIT license, and it has reached over 13,000 GitHub stars at langfuse/langfuse. The platform is available as a managed cloud service and as a self-hosted deployment. Both offer the same feature set; the difference is who manages the infrastructure.

The core product covers four areas that every production LLM application eventually needs:

Distributed tracing for recording exactly what happened during an agent run
Cost and token tracking for understanding what you're spending and where
Prompt management for versioning and deploying prompt templates without code deploys
Evaluation for measuring output quality at scale, not just during manual testing

These four things are related. You can't run good evaluations without good traces. You can't manage prompt versions meaningfully if you can't see which version produced which outputs. Langfuse bundles them because teams that need one usually need all of them.

Tracing: span-level visibility

The tracing model in Langfuse follows the OpenTelemetry mental model: a trace represents one complete agent run or request, composed of spans for each step. Spans are nested: a top-level trace might have spans for retrieval, reranking, and generation, and the generation span might have sub-spans for each tool call the LLM made.

Every span records:

Input and output at that step
Timing (start, end, latency)
Model used and provider
Token counts (prompt, completion, total)
Calculated cost based on Langfuse's provider pricing tables
Any metadata, tags, or custom attributes you add

The result is a complete picture of what happened in a run. When an agent produces a wrong answer, you can trace exactly which retrieval result contained the bad information, which prompt the model saw, and which tool call returned a value the model misinterpreted. This is the difference between a 5-minute debug session and a 2-hour one.

Adding tracing to your application

Langfuse supports three integration paths:

SDK-based instrumentation using the Python or TypeScript SDK:

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse()

@observe()
def retrieve_documents(query: str) -> list[str]:
    # your retrieval logic
    return ["doc1 content", "doc2 content"]

@observe()
def generate_answer(context: list[str], question: str) -> str:
    langfuse_context.update_current_observation(
        input={"question": question},
        metadata={"context_length": len(context)}
    )
    # your LLM call
    return "the answer"

@observe(name="rag-pipeline")
def answer_question(question: str) -> str:
    docs = retrieve_documents(question)
    return generate_answer(docs, question)

The @observe() decorator creates a span automatically. Nesting decorators creates nested spans. You call langfuse_context.update_current_observation() to add metadata, scores, or custom input/output overrides to the current span.

Native framework integrations where you configure a Langfuse callback handler and the framework instruments itself:

from langfuse.callback import CallbackHandler

langfuse_handler = CallbackHandler(
    public_key="your-public-key",
    secret_key="your-secret-key",
)

# LangChain
chain.invoke({"input": "your question"}, config={"callbacks": [langfuse_handler]})

# LlamaIndex
from llama_index.core import Settings
Settings.callback_manager.add_handler(langfuse_handler)

OpenTelemetry ingestion for systems that already emit OTEL spans. Langfuse exposes an OTEL-compatible endpoint, so any framework or service that uses OTEL tracing can send data to Langfuse without the native SDK. This matters for polyglot systems where not everything runs Python or TypeScript.

Cost tracking

Every span that involves an LLM call gets a cost estimate based on token counts and Langfuse's provider pricing tables. The tables cover OpenAI, Anthropic, Cohere, Mistral, Google, and many other providers. You can also add custom model definitions with your own per-token rates, which matters if you're running inference through a custom endpoint or a provider Langfuse doesn't list by default.

The cost data is aggregated across projects, models, users, and time periods in the dashboard. You can answer questions like:

What percentage of our total cost comes from GPT-4o versus Claude Haiku?
Which users are driving the most expensive sessions?
Did our cost per query increase after we switched to a different retrieval strategy?
What's our average cost per successful task completion?

For teams building on top of paid LLM APIs, this visibility is genuinely useful for budget planning and for making the case that switching models is worth the quality tradeoff.

Prompt management

Prompt management is the feature that gets underestimated until you've felt the pain of the alternative. Without a prompt registry, your prompts live in code. Updating a prompt requires a code change, a PR, a review, and a deploy. If you want to test different prompt variants, you're managing that in code too. If something goes wrong in production, you can't roll back just the prompt without rolling back the code.

Langfuse's prompt management decouples prompts from code:

from langfuse import Langfuse

langfuse = Langfuse()

# Fetch the production-labeled version of a prompt
prompt = langfuse.get_prompt("rag-system-prompt")

# Use the template with variable substitution
system_message = prompt.compile(context="relevant docs", user_name="Alice")

The prompt template lives in Langfuse. You label a version as "production" in the UI. Your application fetches that label at runtime. To update the prompt, you edit the template in Langfuse and relabel the new version as production. No code change. No deploy. Rollback is relabeling the previous version.

Langfuse also links prompt versions to the traces that used them, so you can query "what outputs did version 12 of this prompt produce" and compare quality across versions. This is the data that A/B testing prompt variants actually needs.

Evaluation

Langfuse's evaluation system runs assessments on traces and stores scores alongside the trace data. There are three evaluation approaches:

LLM-as-judge runs a separate LLM call to score each trace against a rubric:

from langfuse import Langfuse

langfuse = Langfuse()

# Score a trace after the fact
langfuse.score(
    trace_id="trace-abc-123",
    name="answer-relevance",
    value=0.85,
    comment="Answer correctly addresses the question with appropriate detail"
)

You can set up automatic evaluators that run on new traces as they arrive, using an LLM to score relevance, faithfulness, or any custom dimension you define. This turns manual spot-checking into an automatic pipeline.

Dataset evaluations run your application against a labeled set of inputs and expected outputs, storing each run's results as scores. When you change a prompt or switch models, you run the dataset eval and compare the new scores to the baseline. This is regression testing for LLM quality.

Human annotation lets you build a review queue where team members label traces directly in the Langfuse UI. Human scores are stored as evaluations alongside automated scores, which lets you calibrate automated evaluators against human judgment.

The eval tooling is functional but not the deepest implementation available. Dedicated eval platforms like Braintrust offer more sophisticated statistical analysis and testing primitives. For teams who want evaluation as part of a unified observability tool rather than a separate product, Langfuse's built-in eval is a reasonable tradeoff.

Self-hosting

The self-hosting path is one of Langfuse's genuine advantages over LangSmith. Because the license is MIT, you can run the full platform on your own infrastructure with no per-event fees and no data leaving your network.

The standard deployment is Docker Compose:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d

This runs the Langfuse server and a Postgres database locally. For production, the recommended setup adds a managed Postgres instance (RDS, Cloud SQL, or Supabase) and deploys the Langfuse server behind a load balancer. The official Helm chart handles Kubernetes deployments.

The operational surface is minimal: Langfuse is a stateless web app backed by Postgres. If you can manage a Postgres database, you can self-host Langfuse. The main considerations are:

Postgres storage for traces, which grows with usage. Plan for data retention policies.
The Langfuse server container, which is stateless and horizontally scalable.
Background workers for async evaluation jobs, which run as a separate container.

Data residency is the most common reason teams choose self-hosting. If your traces contain personally identifiable information, sensitive business logic, or data that must stay in a specific region, self-hosting is the straightforward solution.

Langfuse vs LangSmith

This is the comparison most teams make. The honest summary:

LangSmith wins if you're building on LangChain. The integration is tighter, the UI is tuned for LangChain primitives, and the evaluation tooling has more polished workflows. LangSmith also has a free development tier, though its self-hosting requires an enterprise license.

Langfuse wins if you need self-hosting, framework agnosticism, or cost predictability. The MIT license means self-hosting is genuinely free. The OTEL ingestion means any framework works. The pricing on the cloud tier is simpler and cheaper for most workloads.

For teams using LangGraph or AutoGen, both products work well. For teams using frameworks outside the LangChain ecosystem, DSPy, Griptape, or direct API calls, Langfuse's framework-agnostic approach is the cleaner path.

Integrating with specific frameworks

Beyond LangChain and LlamaIndex, Langfuse has documented integrations for:

OpenAI SDK (Python and JS)
Anthropic SDK
Haystack pipelines
AWS Bedrock via OTEL
Vercel AI SDK
Dify (via API)
Any framework that emits OTEL traces

For frameworks without a dedicated integration, the low-level SDK lets you create traces and spans manually. It's more instrumentation work, but it's not significantly harder than adding structured logging to any other application.

Pricing and running costs

The cloud tiers as of May 2026:

Hobby: Free, 50,000 observations/month, 30-day data retention, 1 project
Pro: $59/month, 200,000 observations included + $10 per additional 100k, 90-day retention, unlimited projects
Team: $499/month, higher limits, SSO, dedicated support

For self-hosted deployments, the only costs are compute and storage for your Postgres database and Langfuse containers. For a moderate-sized production deployment, that's typically $30-100/month on a managed cloud provider, depending on trace volume and retention.

The math usually favors self-hosting for teams with data residency requirements or trace volumes above 200,000 per month. It favors cloud for teams who want zero infrastructure overhead and whose data can live in a US-hosted SaaS.

The verdict

Langfuse has earned its position as the default answer for teams that need LLM observability without LangChain lock-in. The combination of span-level tracing, prompt versioning, and built-in evaluation in a self-hostable MIT-licensed package is genuinely compelling. The integration breadth is wide enough that "does Langfuse work with X" is almost always a yes.

The tradeoffs are real: the eval tooling is less sophisticated than dedicated eval platforms, self-hosting adds operational overhead, and the alert system requires external tooling. But for most teams, those are acceptable limitations against the core value: knowing what your agents actually did and being able to improve them systematically.

If you're running agents in production and you don't have visibility into what's happening inside them, Langfuse should be the first tool you add to your stack.

Key features

Distributed tracing for agents, chains, and LLM calls with span-level detail
Cost and token tracking across providers with per-model pricing tables
Prompt management with versioning, staging, and A/B testing
Dataset-based evaluations with LLM-as-judge and custom scorers
Session and user tracking for multi-turn conversation analysis
Native integrations for LangChain, LlamaIndex, OpenAI SDK, and 30+ frameworks
OpenTelemetry-compatible ingestion for custom instrumentation
Self-hostable on Docker or Kubernetes with MIT license

Frequently Asked Questions

What is Langfuse?

Langfuse is an open-source observability and evaluation platform for LLM applications. It captures traces of your agent runs, chains, and LLM calls with span-level timing and input/output detail. Beyond tracing, it manages prompt versions, tracks token costs, and runs automated evaluations on production data. The platform is available as a cloud service with a free tier or as a self-hosted installation under the MIT license.

How does Langfuse compare to LangSmith?

LangSmith is LangChain's observability product. It has tighter integration with LangChain primitives and is generally more polished for teams already using LangChain. Langfuse is framework-agnostic: it works with any LLM framework or direct API calls through its SDK and OpenTelemetry ingestion. Langfuse is also self-hostable under MIT, which matters for teams with data residency requirements that cannot use a US-hosted SaaS. LangSmith's free tier is generous but its self-hosting option requires an enterprise license. For non-LangChain teams, Langfuse is usually the better fit.

Can I self-host Langfuse?

Yes. Langfuse is MIT-licensed and designed for self-hosting. The standard deployment is a Docker Compose setup with a Postgres database. For production, a Kubernetes deployment with the official Helm chart is the recommended path. Self-hosting gives you full control over data retention, residency, and cost, since there are no per-event fees on a self-hosted installation. The main operational cost is managing the Postgres database and the Langfuse containers.

What is prompt management in Langfuse?

Langfuse's prompt management feature lets you store, version, and deploy prompt templates from a central registry. You create prompts in the Langfuse UI or API, label versions as production or staging, and fetch the active version in your application at runtime. This decouples prompt iteration from code deployments, so you can update a prompt without pushing a new release. Langfuse also links prompt versions to the traces that used them, so you can see which version produced which outputs.

What SDKs does Langfuse support?

Langfuse ships a Python SDK and a TypeScript/JavaScript SDK. It also has native integrations with LangChain (Python and JS), LlamaIndex, OpenAI SDK, Anthropic SDK, and several other frameworks that add zero-config tracing via callbacks or monkey-patching. For frameworks without a native integration, Langfuse accepts OpenTelemetry-formatted traces, which means any system that can emit OTEL spans can send data to Langfuse without a dedicated SDK.

Is Langfuse's free tier actually usable?

Yes, for development and modest production workloads. The hobby cloud plan includes 50,000 observations per month, which is enough for active development and small applications. The main limits are on data retention (shorter than paid plans) and the number of projects per account. For heavier production workloads, the Pro plan at $59/month removes most practical constraints.