LangSmith
LLM observability, testing, and evaluation platform from the LangChain team
LangSmith is the observability and evaluation platform built by LangChain for teams developing production LLM applications. It captures traces of every LLM call and chain step, lets you build evaluation datasets from real production traces, run automated and human evaluations, and monitor quality metrics over time. It works with LangChain applications natively and with any LLM application via the SDK.
LangChain's trajectory from open-source library to production platform company is reflected in LangSmith. When LangChain launched in late 2022, it was a Python library for chaining LLM calls together. As the library grew in adoption, teams building production applications ran into the same problem: LangChain made it easy to build complex chains, but debugging why a chain produced a bad output was difficult without visibility into what happened at each step.
LangSmith, launched in September 2023, was the answer. It started as observability infrastructure for LangChain applications and expanded into a full evaluation and quality management platform.
Tracing
The foundation of LangSmith is tracing. Every LLM call, every chain step, every tool invocation in your application creates a trace record. Traces are hierarchical: a top-level trace for a request can contain nested traces for each step in a multi-step chain, each tool call, each sub-agent action.
For LangChain and LangGraph applications, tracing is near-automatic. Set two environment variables pointing to your LangSmith project, and every run of your chain gets logged. For applications not using LangChain, the SDK lets you manually wrap LLM calls and chain steps to capture the same level of detail.
The trace view is where debugging happens. When a user reports that your application gave a bad answer, you find the trace for that request, open it, and walk through each step. You see the exact prompt sent to the model at each stage, the model's output, how that output was parsed and passed to the next step, and where something went wrong. This kind of post-hoc debugging is much harder without trace data.
Evaluation infrastructure
LangSmith's evaluation tools are its distinguishing feature compared to simpler observability platforms. The system is built around three components: datasets, evaluators, and runs.
Datasets are collections of input/output pairs that represent test cases. The insight that makes LangSmith's dataset workflow practical is that you can build datasets from production traces. When you see a real request where the response was particularly good or particularly bad, you can add it to an eval dataset with one click. This means your eval sets are grounded in real usage patterns rather than synthetic examples.
Evaluators score the outputs your application produces against the expected outputs in the dataset. LangSmith supports two types. Custom evaluators are Python functions you write that take an input, an output, and optionally a reference output, and return a score. LLM-as-judge evaluators send the response to another LLM with a scoring prompt and parse the score from that response. Both types are useful. Custom evaluators are precise but require you to write the scoring logic. LLM-as-judge handles qualitative dimensions. does this answer sound helpful? is the tone appropriate? These are hard to express as code.
Evaluation runs produce score distributions across your dataset. When you change a prompt or swap a model, you run your eval suite and compare the score distributions before and after. If the new version scores better on average and doesn't introduce new failure modes, that's evidence it's an improvement. If it regresses on specific input types, you know before pushing to production.
Prompt hub and versioning
LangSmith includes a prompt management system called the Prompt Hub. You store prompts as named artifacts with version history. Applications reference prompts by name and can specify a version or pull the latest.
This solves a real problem in team settings. Without versioning, prompts are strings that live in code or, worse, in environment variables. When someone changes a prompt and things get worse, tracing which change caused the regression requires digging through git history and correlating with production behavior. With LangSmith's prompt versioning, every prompt change is recorded, and traces reference the exact prompt version that generated them.
The Playground integrates with the Prompt Hub. You can open a trace, see which prompt version it used, open that prompt in the Playground, tweak it, and immediately see how the change affects output on the same input. The iteration loop between seeing a bad output, understanding why, and testing a fix is much tighter than the typical cycle of editing code, deploying, and observing in production.
Human annotation
Automated evaluation covers a lot of ground, but some quality dimensions require human judgment. LangSmith's annotation queues let you route traces to human reviewers for labeling and scoring.
The annotation workflow is useful for building ground truth datasets from scratch, for cases where LLM-as-judge scoring might itself be unreliable, and for tasks where quality is genuinely hard to specify algorithmically. A human annotator reads the input and output, rates the quality on one or more dimensions, and optionally adds a correction. Those labeled examples feed back into eval datasets.
For teams building applications in domains with clear quality standards but complex evaluation criteria (legal, medical, financial), human annotation provides a baseline that automated evaluators can be calibrated against.
Monitoring dashboards
Beyond debugging individual traces, LangSmith provides aggregate monitoring. Latency percentiles, error rates, cost estimates, and volume trends are all tracked over time. You can set alerts for anomalies: a spike in error rates, a jump in p99 latency, an unexpected cost increase.
The monitoring is less specialized for cost breakdown by user than Helicone, which has made per-user cost tracking a core feature. LangSmith's monitoring is more focused on quality signals and aggregate performance than granular cost attribution.
For teams that care primarily about LLM output quality and have cost monitoring handled elsewhere (or don't need the per-user granularity), LangSmith's monitoring is sufficient. For teams building multi-tenant SaaS products where per-customer cost tracking is a billing requirement, a dedicated cost monitoring tool alongside LangSmith is a common pattern.
Integration breadth
LangSmith's SDK supports OpenAI, Anthropic, Google, Mistral, and any other provider accessible via the major LLM libraries. The LangChain integration is deepest, but direct provider integrations are well-documented.
The Python and JavaScript SDKs provide a traceable decorator and context manager for wrapping functions you want to trace. For applications built with standard provider SDKs rather than LangChain, adding LangSmith tracing means wrapping your LLM call functions with the decorator and setting the environment variables. The overhead is modest, typically a few lines per function rather than a full SDK refactor.
LangSmith vs LangFuse
LangFuse is an open-source alternative with strong self-hosting support and a similar feature set. The comparison worth making is about priorities.
LangFuse has a cleaner open-source story with full self-hosting that many teams report is easier to operate than LangSmith's self-hosted version. LangSmith has stronger native LangChain integration and a more mature evaluation system, particularly for LLM-as-judge workflows.
For teams not using LangChain who want full control over their data, LangFuse's open-source posture is more complete. For teams using LangChain and building serious eval infrastructure, LangSmith's native integration and evaluation tooling are harder to replicate.
Getting started
For LangChain users, the quickest start is setting LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY in your environment. Every LangChain run will start appearing in the LangSmith dashboard immediately.
For non-LangChain applications, install the langsmith Python package and use the traceable decorator on your LLM-calling functions. The getting started documentation covers both paths.
The eval system is worth investing in early rather than retrofitting later. Start collecting traces in development. When you have a few hundred traces, identify the interesting ones and build a small evaluation dataset. Write or configure evaluators. Run them before any significant prompt change. This discipline catches regressions before users do.
The free tier's 5,000 traces per month is limiting for active development cycles. If you're iterating quickly on a product, the $39/month Plus plan's 50,000 traces is a more comfortable ceiling for a small team.
Key features
- Full trace logging for LLM chains with nested step visibility
- Dataset management: build eval datasets from production traces
- Automated evaluation with LLM-as-judge scoring
- Human annotation queues for labeling and quality review
- Prompt hub for storing and versioning prompts
- A/B testing for comparing prompt versions or model configurations
- Playground for iterating on prompts with trace data as context
- Real-time monitoring dashboards with latency and error tracking
Pros and cons
Pros
- + Deep native integration with LangChain and LangGraph for zero-config tracing
- + Evaluation tooling is the most mature in the LLM observability space
- + Dataset creation from production traces makes eval datasets easy to build
- + LLM-as-judge automated evaluation covers most quality dimensions
- + Prompt hub with versioning solves prompt management in team settings
Cons
- − SDK-based integration requires more setup than proxy-based tools like Helicone
- − Free tier of 5,000 traces per month is limiting for active development
- − UI can feel complex for teams that only need basic cost and latency monitoring
Who is LangSmith for?
- Teams building LangChain or LangGraph applications needing native observability
- Product teams running systematic quality evaluations of LLM outputs
- ML engineers managing prompt versions and tracking quality regression
- Organizations building annotation workflows for human eval at scale
Alternatives to LangSmith
If LangSmith isn't quite the right fit, the closest alternatives are helicone , langfuse , and portkey . See our full LangSmith alternatives page for side-by-side comparisons.
Frequently Asked Questions
What is LangSmith?
Do I need to use LangChain to use LangSmith?
What is LangSmith's evaluation feature?
How does LangSmith pricing work?
What is the LangSmith Playground?
Related agents
Anthropic Computer Use
Claude's computer-use capability that powers desktop and browser agents
Anthropic Skills
Pre-built and custom skills for Claude that extend what Claude can do in Claude Code
AssemblyAI
Speech-to-text API and audio intelligence platform with LLM-powered analysis via LeMUR