Agentbrisk
developer-toolsopen-sourceapi Status: active

Langfuse

Open-source LLM observability with full self-hosting and production-ready tracing


Langfuse is an open-source LLM observability platform that teams use to trace, monitor, and evaluate production AI applications. The self-hosting option is a first-class feature backed by proper documentation and active maintenance, making it the strongest choice among LLM observability tools for teams with data residency requirements or a preference for owning their infrastructure.

Langfuse was started in 2023 by Marc Klingen and Clemens Rawert, two German engineers who wanted an observability tool for LLM applications that they could actually own and control. The hosted market had LangSmith and a handful of others. The gap was a well-maintained, truly open-source option with first-class self-hosting support.

That focus on open source and data ownership has made Langfuse the default choice for teams in Europe navigating GDPR constraints, for organizations with security policies that prohibit sending production data to third-party SaaS platforms, and for the broader segment of developers who simply prefer to own their infrastructure.

Architecture and tracing model

Langfuse organizes observability data around three concepts: traces, observations, and scores.

A trace represents a complete unit of work, typically a single user request from start to finish. Observations are nested inside traces and represent individual steps: an LLM call, a retrieval step, a tool invocation, a custom processing step. Scores attach quality signals to traces or observations, whether from automated evaluators or human annotators.

This hierarchy is similar to how distributed tracing works in general software observability (OpenTelemetry, Jaeger, etc.). Developers familiar with distributed tracing will find the mental model familiar. The hierarchy also means Langfuse handles complex multi-agent and multi-step workflows naturally: each sub-agent action is an observation nested inside the parent trace, and you can see the whole execution tree in the UI.

For simple applications that make a single LLM call per request, the hierarchy is trivially one trace with one observation. The structure doesn't add overhead in the simple case; it just doesn't constrain you when the architecture gets more complex.

Integration approach

Langfuse provides integrations at multiple levels, which is useful because different applications have different needs.

At the highest level, framework-specific integrations handle tracing automatically. For LangChain and LangGraph, adding a LangfuseCallbackHandler to your chain configuration routes all trace data to Langfuse without modifying your application logic. For LlamaIndex, a similar callback integration exists. For OpenAI SDK users, Langfuse provides a drop-in wrapper that intercepts calls.

At a lower level, the core Python and JavaScript SDKs provide primitives for manual tracing. You create a trace, add observations to it as your code executes, and flush the data. This approach works for any application regardless of framework and gives you precise control over what gets logged.

The two levels are composable. You might use the automatic LangChain integration for the LLM calls and manually log custom observations for processing steps that LangChain doesn't instrument.

Prompt management

The prompt management system in Langfuse is practical and framework-agnostic. You create named prompts with version history, tag specific versions for environments (production, staging, development), and fetch prompts by name in your application.

When your application fetches a prompt from Langfuse at runtime, the trace automatically records which prompt version was used. This means every trace has a pointer to the exact prompt configuration that generated it. Debugging a quality regression involves finding the point in time when the prompt version changed and comparing trace quality before and after.

Prompts support variables, placeholders that get filled in at runtime with dynamic content. The Langfuse UI shows how variables were filled when you inspect a trace, which makes it easy to understand why the model received the prompt it did.

The prompt environment tagging is underrated. Promoting a prompt version from staging to production is a deliberate action in the Langfuse UI rather than a code change. This decouples prompt iteration from code deployment cycles and lets non-engineers who understand the domain iterate on prompts without requiring developer involvement for each change.

Evaluation system

Langfuse's evaluation tooling covers both automated and human evaluation paths.

For automated evaluation, you build a dataset of input/expected output pairs, configure evaluators (either Python functions or LLM-as-judge prompts), and run experiments. Langfuse runs your current application against each dataset item, collects scores, and records them against the dataset run. You can compare runs from different prompt versions or model configurations side by side.

LLM-as-judge evaluation is particularly useful for qualitative dimensions. You write a scoring prompt that describes what you're looking for (helpful tone, factual accuracy, appropriate length) and Langfuse submits each output to an evaluation model that scores it on those dimensions. The LLM judge is configurable; you can use OpenAI, Anthropic, or any provider you have access to.

Human annotation is supported through annotation queues. You configure criteria, route traces or trace samples to the queue, and annotators review them in Langfuse's annotation UI. Human scores attach to the same trace and observation objects as automated scores, so you can compare the two.

Session tracking

Multi-turn conversations present a specific challenge for LLM observability. Each turn in a conversation is a separate request, but the quality of any single response depends on the history of the conversation. Analyzing turns in isolation misses context.

Langfuse's session feature groups related traces into sessions. A conversation session contains all the trace records for each turn, accessible together in the UI. You can review the full conversation flow, see how the model's context accumulated over turns, and understand where the conversation went wrong.

For chatbot and conversational AI applications, session tracking makes debugging multi-turn issues practical. Without it, you'd need to manually correlate individual trace records to reconstruct the conversation context.

Self-hosting in depth

Self-hosting Langfuse is documented and actively maintained. The minimal deployment uses Docker Compose with Langfuse's application server, a PostgreSQL database, and a Clickhouse instance for analytics data. The Compose file and environment variable documentation are current and well-tested.

For production deployments, Langfuse recommends using managed database services (RDS, Cloud SQL, etc.) rather than database containers, setting up proper backup procedures for the PostgreSQL data, and configuring Clickhouse storage to scale with trace volume.

A Helm chart is available for Kubernetes deployments. Organizations with existing Kubernetes infrastructure can deploy Langfuse as another chart in their cluster rather than running a separate VM-based deployment.

The self-hosted version has no usage limits. If you need to store ten million traces per month, self-hosting on sufficient database storage is cheaper than the cloud tier at that scale. The operational cost is your time and the compute bill, but for teams with existing infrastructure engineering capacity, the total cost can be lower than the cloud tier at high volumes.

Cloud vs self-hosted decision

The cloud version at langfuse.com is the easiest starting point. The free tier's 50,000 observations per month is enough for a small production application or active development on a larger one. Pro at $59/month scales to a million observations, which is sufficient for most single-product teams.

Self-hosting makes sense when data residency is a requirement (particularly GDPR-driven requirements for EU teams), when your trace volume is high enough that self-hosting compute costs less than cloud tier pricing, or when your organization's security policies prohibit sending production interaction data to third-party platforms.

The migration path between cloud and self-hosted is straightforward because both use the same application code. Teams often start on the cloud free tier, validate that Langfuse fits their workflow, and then migrate to self-hosting when the data privacy or cost reasons become compelling.

Getting started

The fastest path is creating an account at cloud.langfuse.com, installing the Python SDK, and adding the LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST environment variables. For LangChain users, add the callback handler. For others, use the @observe decorator from the SDK to wrap your LLM-calling functions.

Traces should start appearing in the dashboard within seconds of your first decorated function call. The observation detail view shows the full prompt, completion, token counts, cost estimate, and latency for every logged LLM call.

The prompt management feature is worth adopting early. Moving your first production prompt into Langfuse, even before you set up evaluation, gives you version history and the ability to iterate without code changes. That discipline pays off the first time you need to trace a quality regression back to a prompt change.

Key features

  • Full trace and span logging for any LLM framework or direct API calls
  • Self-hosting via Docker Compose or Kubernetes: own your data completely
  • Prompt management with versioning, tags, and production/staging environments
  • Dataset and evaluation system: run evals on curated test sets
  • Score collection for human feedback and LLM-as-judge evaluation
  • Analytics dashboards for latency, cost, token usage, and quality scores
  • Integration with LangChain, LlamaIndex, OpenAI SDK, and direct tracing API
  • Session tracking for multi-turn conversations

Pros and cons

Pros

  • + Self-hosting is genuinely supported: Docker Compose deployment is well-documented and maintained
  • + Free cloud tier: 50,000 observations per month is generous for small teams
  • + Integrates with every major LLM framework without lock-in
  • + Open-source MIT license with active development and public roadmap
  • + Session tracking handles multi-turn conversations as single units of analysis

Cons

  • − Smaller ecosystem than LangSmith for LangChain-specific integrations
  • − Cloud tier pricing jumps from $59 to $99/month with less granular scaling
  • − Some advanced evaluation features are still maturing relative to LangSmith

Who is Langfuse for?

  • Teams with GDPR or data residency requirements needing a self-hosted solution
  • Open-source-first organizations that want auditable infrastructure
  • Production LLM applications needing trace logging without framework lock-in
  • Teams running both automated and human evaluation workflows

Alternatives to Langfuse

If Langfuse isn't quite the right fit, the closest alternatives are helicone , langsmith , and portkey . See our full Langfuse alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is Langfuse?
Langfuse is an open-source observability and evaluation platform for LLM applications. It captures traces of LLM calls, measures latency and costs, manages prompt versions, and supports running evaluations against curated datasets. You can use the hosted cloud version at langfuse.com or self-host the entire platform using Docker Compose. It integrates with LangChain, LlamaIndex, OpenAI's SDK, and most other LLM frameworks.
How do I self-host Langfuse?
Self-hosting Langfuse requires Docker and Docker Compose. You clone the repository, set environment variables for your database connection and authentication secrets, and run docker compose up. Langfuse uses PostgreSQL for storage and Clickhouse for analytics data. The documentation covers production hardening, including using managed database services rather than running databases in containers. For Kubernetes deployments, a Helm chart is available.
Does Langfuse work with non-LangChain applications?
Yes. Langfuse works with any LLM application. The SDK provides decorators and context managers for tracing in Python and JavaScript without depending on any particular framework. There are also dedicated integrations for LangChain, LlamaIndex, and OpenAI's SDK that reduce the manual wrapping required. If you're calling an LLM provider directly, you can use the low-level Langfuse client to log observations manually.
How does Langfuse compare to LangSmith?
Both platforms cover the same core use cases: tracing, prompt management, and evaluation. The main differences are in self-hosting and open-source depth. Langfuse's self-hosting story is more complete: the repository is fully open, the deployment documentation is thorough, and many teams run it successfully on their own infrastructure. LangSmith has deeper native LangChain integration and a more mature evaluation system for teams heavily invested in the LangChain ecosystem. LangSmith's free tier is smaller (5,000 traces vs 50,000 observations). The right choice depends on your framework preference and data control requirements.
What counts as an observation in Langfuse?
An observation is a logged unit of work in Langfuse, typically a single LLM call, a span (a segment of processing), or an event. A trace is a tree of observations representing a complete request or workflow. For a simple application that makes one LLM call per user request, each request generates one trace with one observation. For complex chains with multiple steps, each step is its own observation nested within the trace. The free tier's 50,000 observations per month corresponds to roughly 50,000 simple LLM calls, or fewer for complex multi-step chains.

Related agents

Search