Langfuse
Open-source LLM observability with full self-hosting and production-ready tracing
Langfuse is an open-source LLM observability platform that teams use to trace, monitor, and evaluate production AI applications. The self-hosting option is a first-class feature backed by proper documentation and active maintenance, making it the strongest choice among LLM observability tools for teams with data residency requirements or a preference for owning their infrastructure.
Langfuse was started in 2023 by Marc Klingen and Clemens Rawert, two German engineers who wanted an observability tool for LLM applications that they could actually own and control. The hosted market had LangSmith and a handful of others. The gap was a well-maintained, truly open-source option with first-class self-hosting support.
That focus on open source and data ownership has made Langfuse the default choice for teams in Europe navigating GDPR constraints, for organizations with security policies that prohibit sending production data to third-party SaaS platforms, and for the broader segment of developers who simply prefer to own their infrastructure.
Architecture and tracing model
Langfuse organizes observability data around three concepts: traces, observations, and scores.
A trace represents a complete unit of work, typically a single user request from start to finish. Observations are nested inside traces and represent individual steps: an LLM call, a retrieval step, a tool invocation, a custom processing step. Scores attach quality signals to traces or observations, whether from automated evaluators or human annotators.
This hierarchy is similar to how distributed tracing works in general software observability (OpenTelemetry, Jaeger, etc.). Developers familiar with distributed tracing will find the mental model familiar. The hierarchy also means Langfuse handles complex multi-agent and multi-step workflows naturally: each sub-agent action is an observation nested inside the parent trace, and you can see the whole execution tree in the UI.
For simple applications that make a single LLM call per request, the hierarchy is trivially one trace with one observation. The structure doesn't add overhead in the simple case; it just doesn't constrain you when the architecture gets more complex.
Integration approach
Langfuse provides integrations at multiple levels, which is useful because different applications have different needs.
At the highest level, framework-specific integrations handle tracing automatically. For LangChain and LangGraph, adding a LangfuseCallbackHandler to your chain configuration routes all trace data to Langfuse without modifying your application logic. For LlamaIndex, a similar callback integration exists. For OpenAI SDK users, Langfuse provides a drop-in wrapper that intercepts calls.
At a lower level, the core Python and JavaScript SDKs provide primitives for manual tracing. You create a trace, add observations to it as your code executes, and flush the data. This approach works for any application regardless of framework and gives you precise control over what gets logged.
The two levels are composable. You might use the automatic LangChain integration for the LLM calls and manually log custom observations for processing steps that LangChain doesn't instrument.
Prompt management
The prompt management system in Langfuse is practical and framework-agnostic. You create named prompts with version history, tag specific versions for environments (production, staging, development), and fetch prompts by name in your application.
When your application fetches a prompt from Langfuse at runtime, the trace automatically records which prompt version was used. This means every trace has a pointer to the exact prompt configuration that generated it. Debugging a quality regression involves finding the point in time when the prompt version changed and comparing trace quality before and after.
Prompts support variables, placeholders that get filled in at runtime with dynamic content. The Langfuse UI shows how variables were filled when you inspect a trace, which makes it easy to understand why the model received the prompt it did.
The prompt environment tagging is underrated. Promoting a prompt version from staging to production is a deliberate action in the Langfuse UI rather than a code change. This decouples prompt iteration from code deployment cycles and lets non-engineers who understand the domain iterate on prompts without requiring developer involvement for each change.
Evaluation system
Langfuse's evaluation tooling covers both automated and human evaluation paths.
For automated evaluation, you build a dataset of input/expected output pairs, configure evaluators (either Python functions or LLM-as-judge prompts), and run experiments. Langfuse runs your current application against each dataset item, collects scores, and records them against the dataset run. You can compare runs from different prompt versions or model configurations side by side.
LLM-as-judge evaluation is particularly useful for qualitative dimensions. You write a scoring prompt that describes what you're looking for (helpful tone, factual accuracy, appropriate length) and Langfuse submits each output to an evaluation model that scores it on those dimensions. The LLM judge is configurable; you can use OpenAI, Anthropic, or any provider you have access to.
Human annotation is supported through annotation queues. You configure criteria, route traces or trace samples to the queue, and annotators review them in Langfuse's annotation UI. Human scores attach to the same trace and observation objects as automated scores, so you can compare the two.
Session tracking
Multi-turn conversations present a specific challenge for LLM observability. Each turn in a conversation is a separate request, but the quality of any single response depends on the history of the conversation. Analyzing turns in isolation misses context.
Langfuse's session feature groups related traces into sessions. A conversation session contains all the trace records for each turn, accessible together in the UI. You can review the full conversation flow, see how the model's context accumulated over turns, and understand where the conversation went wrong.
For chatbot and conversational AI applications, session tracking makes debugging multi-turn issues practical. Without it, you'd need to manually correlate individual trace records to reconstruct the conversation context.
Self-hosting in depth
Self-hosting Langfuse is documented and actively maintained. The minimal deployment uses Docker Compose with Langfuse's application server, a PostgreSQL database, and a Clickhouse instance for analytics data. The Compose file and environment variable documentation are current and well-tested.
For production deployments, Langfuse recommends using managed database services (RDS, Cloud SQL, etc.) rather than database containers, setting up proper backup procedures for the PostgreSQL data, and configuring Clickhouse storage to scale with trace volume.
A Helm chart is available for Kubernetes deployments. Organizations with existing Kubernetes infrastructure can deploy Langfuse as another chart in their cluster rather than running a separate VM-based deployment.
The self-hosted version has no usage limits. If you need to store ten million traces per month, self-hosting on sufficient database storage is cheaper than the cloud tier at that scale. The operational cost is your time and the compute bill, but for teams with existing infrastructure engineering capacity, the total cost can be lower than the cloud tier at high volumes.
Cloud vs self-hosted decision
The cloud version at langfuse.com is the easiest starting point. The free tier's 50,000 observations per month is enough for a small production application or active development on a larger one. Pro at $59/month scales to a million observations, which is sufficient for most single-product teams.
Self-hosting makes sense when data residency is a requirement (particularly GDPR-driven requirements for EU teams), when your trace volume is high enough that self-hosting compute costs less than cloud tier pricing, or when your organization's security policies prohibit sending production interaction data to third-party platforms.
The migration path between cloud and self-hosted is straightforward because both use the same application code. Teams often start on the cloud free tier, validate that Langfuse fits their workflow, and then migrate to self-hosting when the data privacy or cost reasons become compelling.
Getting started
The fastest path is creating an account at cloud.langfuse.com, installing the Python SDK, and adding the LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST environment variables. For LangChain users, add the callback handler. For others, use the @observe decorator from the SDK to wrap your LLM-calling functions.
Traces should start appearing in the dashboard within seconds of your first decorated function call. The observation detail view shows the full prompt, completion, token counts, cost estimate, and latency for every logged LLM call.
The prompt management feature is worth adopting early. Moving your first production prompt into Langfuse, even before you set up evaluation, gives you version history and the ability to iterate without code changes. That discipline pays off the first time you need to trace a quality regression back to a prompt change.
Key features
- Full trace and span logging for any LLM framework or direct API calls
- Self-hosting via Docker Compose or Kubernetes: own your data completely
- Prompt management with versioning, tags, and production/staging environments
- Dataset and evaluation system: run evals on curated test sets
- Score collection for human feedback and LLM-as-judge evaluation
- Analytics dashboards for latency, cost, token usage, and quality scores
- Integration with LangChain, LlamaIndex, OpenAI SDK, and direct tracing API
- Session tracking for multi-turn conversations
Pros and cons
Pros
- + Self-hosting is genuinely supported: Docker Compose deployment is well-documented and maintained
- + Free cloud tier: 50,000 observations per month is generous for small teams
- + Integrates with every major LLM framework without lock-in
- + Open-source MIT license with active development and public roadmap
- + Session tracking handles multi-turn conversations as single units of analysis
Cons
- − Smaller ecosystem than LangSmith for LangChain-specific integrations
- − Cloud tier pricing jumps from $59 to $99/month with less granular scaling
- − Some advanced evaluation features are still maturing relative to LangSmith
Who is Langfuse for?
- Teams with GDPR or data residency requirements needing a self-hosted solution
- Open-source-first organizations that want auditable infrastructure
- Production LLM applications needing trace logging without framework lock-in
- Teams running both automated and human evaluation workflows
Alternatives to Langfuse
If Langfuse isn't quite the right fit, the closest alternatives are helicone , langsmith , and portkey . See our full Langfuse alternatives page for side-by-side comparisons.
Frequently Asked Questions
What is Langfuse?
How do I self-host Langfuse?
Does Langfuse work with non-LangChain applications?
How does Langfuse compare to LangSmith?
What counts as an observation in Langfuse?
Related agents
Aide
Open-source AI-native IDE built on VS Code with agent-first workflows and local memory
Anthropic Computer Use
Claude's computer-use capability that powers desktop and browser agents
Anthropic Skills
Pre-built and custom skills for Claude that extend what Claude can do in Claude Code