developer-toolsapiproductivity Status: active

LangSmith

LLM observability, testing, and evaluation platform from the LangChain team

LangSmith is the observability and evaluation platform built by LangChain for teams developing production LLM applications. It captures traces of every LLM call and chain step, lets you build evaluation datasets from real production traces, run automated and human evaluations, and monitor quality metrics over time. It works with LangChain applications natively and with any LLM application via the SDK.

LangChain's trajectory from open-source library to production platform company is reflected in LangSmith. When LangChain launched in late 2022, it was a Python library for chaining LLM calls together. As the library grew in adoption, teams building production applications ran into the same problem: LangChain made it easy to build complex chains, but debugging why a chain produced a bad output was difficult without visibility into what happened at each step.

LangSmith, launched in September 2023, was the answer. It started as observability infrastructure for LangChain applications and expanded into a full evaluation and quality management platform.

Tracing

The foundation of LangSmith is tracing. Every LLM call, every chain step, every tool invocation in your application creates a trace record. Traces are hierarchical: a top-level trace for a request can contain nested traces for each step in a multi-step chain, each tool call, each sub-agent action.

For LangChain and LangGraph applications, tracing is near-automatic. Set two environment variables pointing to your LangSmith project, and every run of your chain gets logged. For applications not using LangChain, the SDK lets you manually wrap LLM calls and chain steps to capture the same level of detail.

The trace view is where debugging happens. When a user reports that your application gave a bad answer, you find the trace for that request, open it, and walk through each step. You see the exact prompt sent to the model at each stage, the model's output, how that output was parsed and passed to the next step, and where something went wrong. This kind of post-hoc debugging is much harder without trace data.

Evaluation infrastructure

LangSmith's evaluation tools are its distinguishing feature compared to simpler observability platforms. The system is built around three components: datasets, evaluators, and runs.

Datasets are collections of input/output pairs that represent test cases. The insight that makes LangSmith's dataset workflow practical is that you can build datasets from production traces. When you see a real request where the response was particularly good or particularly bad, you can add it to an eval dataset with one click. This means your eval sets are grounded in real usage patterns rather than synthetic examples.

Evaluators score the outputs your application produces against the expected outputs in the dataset. LangSmith supports two types. Custom evaluators are Python functions you write that take an input, an output, and optionally a reference output, and return a score. LLM-as-judge evaluators send the response to another LLM with a scoring prompt and parse the score from that response. Both types are useful. Custom evaluators are precise but require you to write the scoring logic. LLM-as-judge handles qualitative dimensions. does this answer sound helpful? is the tone appropriate? These are hard to express as code.

Evaluation runs produce score distributions across your dataset. When you change a prompt or swap a model, you run your eval suite and compare the score distributions before and after. If the new version scores better on average and doesn't introduce new failure modes, that's evidence it's an improvement. If it regresses on specific input types, you know before pushing to production.

Prompt hub and versioning

LangSmith includes a prompt management system called the Prompt Hub. You store prompts as named artifacts with version history. Applications reference prompts by name and can specify a version or pull the latest.

This solves a real problem in team settings. Without versioning, prompts are strings that live in code or, worse, in environment variables. When someone changes a prompt and things get worse, tracing which change caused the regression requires digging through git history and correlating with production behavior. With LangSmith's prompt versioning, every prompt change is recorded, and traces reference the exact prompt version that generated them.

The Playground integrates with the Prompt Hub. You can open a trace, see which prompt version it used, open that prompt in the Playground, tweak it, and immediately see how the change affects output on the same input. The iteration loop between seeing a bad output, understanding why, and testing a fix is much tighter than the typical cycle of editing code, deploying, and observing in production.

Human annotation

Automated evaluation covers a lot of ground, but some quality dimensions require human judgment. LangSmith's annotation queues let you route traces to human reviewers for labeling and scoring.

The annotation workflow is useful for building ground truth datasets from scratch, for cases where LLM-as-judge scoring might itself be unreliable, and for tasks where quality is genuinely hard to specify algorithmically. A human annotator reads the input and output, rates the quality on one or more dimensions, and optionally adds a correction. Those labeled examples feed back into eval datasets.

For teams building applications in domains with clear quality standards but complex evaluation criteria (legal, medical, financial), human annotation provides a baseline that automated evaluators can be calibrated against.

Monitoring dashboards

Beyond debugging individual traces, LangSmith provides aggregate monitoring. Latency percentiles, error rates, cost estimates, and volume trends are all tracked over time. You can set alerts for anomalies: a spike in error rates, a jump in p99 latency, an unexpected cost increase.

The monitoring is less specialized for cost breakdown by user than Helicone, which has made per-user cost tracking a core feature. LangSmith's monitoring is more focused on quality signals and aggregate performance than granular cost attribution.

For teams that care primarily about LLM output quality and have cost monitoring handled elsewhere (or don't need the per-user granularity), LangSmith's monitoring is sufficient. For teams building multi-tenant SaaS products where per-customer cost tracking is a billing requirement, a dedicated cost monitoring tool alongside LangSmith is a common pattern.

Integration breadth

LangSmith's SDK supports OpenAI, Anthropic, Google, Mistral, and any other provider accessible via the major LLM libraries. The LangChain integration is deepest, but direct provider integrations are well-documented.

The Python and JavaScript SDKs provide a traceable decorator and context manager for wrapping functions you want to trace. For applications built with standard provider SDKs rather than LangChain, adding LangSmith tracing means wrapping your LLM call functions with the decorator and setting the environment variables. The overhead is modest, typically a few lines per function rather than a full SDK refactor.

LangSmith vs LangFuse

LangFuse is an open-source alternative with strong self-hosting support and a similar feature set. The comparison worth making is about priorities.

LangFuse has a cleaner open-source story with full self-hosting that many teams report is easier to operate than LangSmith's self-hosted version. LangSmith has stronger native LangChain integration and a more mature evaluation system, particularly for LLM-as-judge workflows.

For teams not using LangChain who want full control over their data, LangFuse's open-source posture is more complete. For teams using LangChain and building serious eval infrastructure, LangSmith's native integration and evaluation tooling are harder to replicate.

Getting started

For LangChain users, the quickest start is setting LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY in your environment. Every LangChain run will start appearing in the LangSmith dashboard immediately.

For non-LangChain applications, install the langsmith Python package and use the traceable decorator on your LLM-calling functions. The getting started documentation covers both paths.

The eval system is worth investing in early rather than retrofitting later. Start collecting traces in development. When you have a few hundred traces, identify the interesting ones and build a small evaluation dataset. Write or configure evaluators. Run them before any significant prompt change. This discipline catches regressions before users do.

The free tier's 5,000 traces per month is limiting for active development cycles. If you're iterating quickly on a product, the $39/month Plus plan's 50,000 traces is a more comfortable ceiling for a small team.

Key features

Full trace logging for LLM chains with nested step visibility
Dataset management: build eval datasets from production traces
Automated evaluation with LLM-as-judge scoring
Human annotation queues for labeling and quality review
Prompt hub for storing and versioning prompts
A/B testing for comparing prompt versions or model configurations
Playground for iterating on prompts with trace data as context
Real-time monitoring dashboards with latency and error tracking

Pros and cons

Pros

+ Deep native integration with LangChain and LangGraph for zero-config tracing
+ Evaluation tooling is the most mature in the LLM observability space
+ Dataset creation from production traces makes eval datasets easy to build
+ LLM-as-judge automated evaluation covers most quality dimensions
+ Prompt hub with versioning solves prompt management in team settings

Cons

− SDK-based integration requires more setup than proxy-based tools like Helicone
− Free tier of 5,000 traces per month is limiting for active development
− UI can feel complex for teams that only need basic cost and latency monitoring

Who is LangSmith for?

Teams building LangChain or LangGraph applications needing native observability
Product teams running systematic quality evaluations of LLM outputs
ML engineers managing prompt versions and tracking quality regression
Organizations building annotation workflows for human eval at scale

Alternatives to LangSmith

If LangSmith isn't quite the right fit, the closest alternatives are helicone , langfuse , and portkey . See our full LangSmith alternatives page for side-by-side comparisons.

Frequently Asked Questions

What is LangSmith?

LangSmith is an LLM observability and evaluation platform built by LangChain. It captures traces of LLM application runs (every model call, every chain step, every tool use) and stores them for debugging, evaluation, and monitoring. Beyond logging, LangSmith has tools for building evaluation datasets, running automated evaluations, collecting human feedback, and tracking quality metrics over time. It integrates natively with LangChain but also works with any LLM application via a Python or JavaScript SDK.

Do I need to use LangChain to use LangSmith?

No. LangSmith works with any LLM application. For LangChain and LangGraph applications, integration is nearly automatic: the frameworks emit traces to LangSmith when environment variables are set. For applications not using LangChain, LangSmith provides a Python and JavaScript SDK for manually wrapping LLM calls to capture traces. The LangChain integration is smoother, but the platform is not limited to LangChain.

What is LangSmith's evaluation feature?

LangSmith's evaluation system lets you run systematic quality checks on your LLM application. You create a dataset of input/output pairs representing cases you want to test. You write evaluators that score responses. either custom Python functions or LLM-as-judge prompts that assess quality on dimensions like relevance, accuracy, or tone. LangSmith runs your application against the dataset, collects the scores, and tracks them over time so you can detect quality regressions when you change prompts or models.

How does LangSmith pricing work?

LangSmith pricing is based on trace volume. The free Developer plan includes 5,000 traces per month, which is enough for small projects and early development. Plus at $39/month includes 50,000 traces and additional features. Team plans are priced per user and include collaboration features. Enterprise pricing is custom with SSO, longer data retention, and SLAs. Additional traces beyond plan limits are billed at $0.005 per trace on Plus.

What is the LangSmith Playground?

The Playground is an interactive prompt editor inside LangSmith. You can open any logged trace, modify the prompt, and re-run it to see how changes affect the output, all without touching your codebase. This is useful for rapid prompt iteration using real production inputs. Instead of testing prompts against synthetic examples you made up, you're testing against the actual inputs your application is receiving from users.