Python commercial observabilityevaluationprompt-management

LangSmith

LangChain's observability and evaluation platform for debugging, testing, and monitoring LLM applications

LangSmith is LangChain's commercial observability, evaluation, and prompt management platform for LLM applications. It traces every step of agent and chain execution, supports dataset-based evaluation and LLM-as-judge scoring, and integrates natively with the full LangChain ecosystem including LangGraph. Available as a cloud SaaS with a free developer tier; self-hosting requires an enterprise license.

If you're running a LangChain or LangGraph application in production and something goes wrong, LangSmith is the tool you reach for. That's not a casual observation. LangChain built LangSmith specifically to solve the observability problem that every team hits: you have a chain, it returns a bad answer, and you have no practical way to see what happened inside it. LangSmith makes the inside visible.

What LangSmith is

LangSmith is a commercial SaaS platform for LLM observability, evaluation, and prompt management. LangChain, Inc. launched it in 2023 alongside the rising popularity of LangChain itself. The pitch was simple: if you're already using LangChain to build, you should use LangSmith to observe. Getting started requires two environment variables and zero code changes.

That origin story shapes what LangSmith is and isn't. It's excellent for LangChain and LangGraph teams. It's workable but more manual for teams on other stacks. And it's a closed-source SaaS, which means your trace data lives on LangChain's infrastructure unless you have an enterprise self-hosting agreement.

The platform covers four areas:

Tracing for recording the complete execution of agent runs and chains
Evaluation for measuring output quality with datasets and automated scoring
Prompt management through the LangChain Hub
Monitoring for ongoing production health metrics

Getting started in two minutes

If you're on LangChain, setup is genuinely this simple:

pip install langsmith langchain-openai

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"

# That's it. Every LangChain run now appears in LangSmith.
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

llm = ChatOpenAI(model="gpt-4o-mini")
response = llm.invoke([HumanMessage(content="What is the capital of France?")])

Every LLM call, chain step, and tool invocation that happens through LangChain creates a trace in LangSmith automatically. You don't decorate functions or manually create spans. The framework does it for you.

For LangGraph, the tracing is equally automatic. A complex multi-node agent graph with conditional branching, tool calls, and human-in-the-loop steps produces a full execution trace where you can inspect every node's input and output.

If you're not on LangChain, you instrument manually:

from langsmith import traceable

@traceable(name="my-retrieval-step")
def retrieve_documents(query: str) -> list[str]:
    # your retrieval logic
    return ["doc1", "doc2"]

@traceable(name="rag-pipeline")
def answer_question(question: str) -> str:
    docs = retrieve_documents(question)
    # your generation logic
    return "answer"

The @traceable decorator creates LangSmith spans. Nesting decorated functions creates nested spans in the trace. It works, but it's the same manual instrumentation you'd write for any observability tool.

Tracing and debugging

A LangSmith trace shows you every step in a run as a tree of spans. For a RAG pipeline, that's typically:

The top-level chain run with the user's question as input
A retrieval span showing the query sent to the vector store and the documents returned
A prompt formatting span showing the exact context window the LLM saw
An LLM call span with the model, token counts, latency, and full output

When something goes wrong, you open the trace and find the span where the inputs look wrong. Maybe the retrieval step returned irrelevant documents. Maybe the prompt template dropped a variable. Maybe a tool returned an error that the agent silently ignored. The trace makes these problems visible in minutes rather than hours.

LangSmith also captures feedback you add to traces. If a user thumbs-down a response, you can attach that signal to the trace and later filter your traces by feedback to build evaluation datasets from your actual failure cases.

Evaluation

LangSmith's evaluation workflow is centered on datasets. You create a dataset of input/output pairs, run your application against the dataset, and score the outputs using evaluators.

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

client = Client()

# Create a dataset
dataset = client.create_dataset("rag-eval-v1")
client.create_examples(
    inputs=[
        {"question": "What is photosynthesis?"},
        {"question": "Who invented the telephone?"},
    ],
    outputs=[
        {"answer": "Photosynthesis converts sunlight into energy in plants."},
        {"answer": "Alexander Graham Bell invented the telephone."},
    ],
    dataset_id=dataset.id,
)

# Define your application as a function
def my_rag_pipeline(inputs: dict) -> dict:
    # your RAG logic here
    return {"answer": "generated answer"}

# Run evaluation
results = evaluate(
    my_rag_pipeline,
    data="rag-eval-v1",
    evaluators=[LangChainStringEvaluator("cot_qa")],
    experiment_prefix="gpt-4o-mini-baseline",
)

The results show up in LangSmith's experiment comparison UI, where you can view scores per example, compare across experiments (different model versions, prompt variants, retrieval strategies), and drill into individual failures.

LangSmith also offers online evaluation: you configure an evaluator to run automatically on a sample of production traces as they arrive. This catches quality regressions in production without waiting for a manual eval run.

The built-in evaluators cover criteria like correctness, relevance, conciseness, and hallucination detection. All of them use an LLM judge under the hood, which introduces both cost and variability. For high-stakes evaluation where you need precise measurement, dedicated platforms like Braintrust offer more statistical rigor.

Prompt management via LangChain Hub

The LangChain Hub is LangSmith's answer to the "prompts in source code" problem. You store prompt templates in the Hub and pull the active version at runtime:

from langchain import hub

# Pull the production-labeled version
prompt = hub.pull("your-username/your-prompt-name")

# Use it in a chain
chain = prompt | llm

Updating the prompt means editing the template in LangSmith's UI and publishing a new version. No code change, no deployment. Rollback is one click. LangSmith links each trace to the prompt version that produced it, so you can compare output quality across versions without guessing which prompt was active during a bad run.

For teams iterating rapidly on prompts, this workflow genuinely speeds things up. The alternative, managing prompt templates in version-controlled files with manual deployment scripts, is slower and more error-prone.

Production monitoring

LangSmith's monitoring dashboards track the metrics that matter for production LLM applications:

Request volume and error rates over time
Latency distributions (p50, p90, p99)
Token usage and estimated cost per run
LLM call success rates by model
Feedback aggregations from your users

The dashboards are configurable and let you break down metrics by tags you add to traces. If you tag traces with the user's plan tier, you can compare latency and cost for free versus paid users. If you tag by the agent type, you can see which agent flows are most expensive.

There's no built-in alerting. If you want to be paged when error rates spike, you need to export metrics to your existing monitoring stack via the API. This is a real gap that LangSmith hasn't addressed as of early 2026.

Pricing in practice

The free Developer plan works for solo developers and small production workloads. 5,000 traces per month runs out quickly if your agents generate many internal spans per user request. A single RAG query with retrieval, reranking, and generation might produce 5-10 spans, so 5,000 traces can represent as few as 500-1,000 user queries.

The Plus plan at $39/month raises the limit to 50,000 traces. For most small teams in production, that's enough. The Teams plan at $299/month is where pricing starts to feel significant.

Where LangSmith gets expensive is at real scale. If you're running tens of millions of LLM calls per month, the per-trace model can reach into the thousands of dollars. At that point, the economics favor either negotiating an enterprise deal or looking at self-hostable alternatives.

Self-hosting requires an enterprise agreement. There's no community self-hosted option the way Langfuse offers. If data residency is a hard requirement and enterprise pricing is out of budget, this is a dealbreaker.

LangSmith vs the alternatives

LangSmith vs Langfuse

The honest comparison: LangSmith wins on developer experience for LangChain teams, Langfuse wins on flexibility and total cost of ownership for everyone else. If your stack is LangGraph, use LangSmith. The zero-config tracing alone saves significant instrumentation time. If you're using anything else, or if self-hosting matters, Langfuse is the better call.

LangSmith vs Helicone

Helicone is a proxy-based observability tool: you route your LLM API calls through Helicone's endpoint and get logs, costs, and latency without changing application code. That's genuinely simpler for basic monitoring. But Helicone doesn't give you the agent-level tracing that LangSmith does. You see individual LLM calls, not the orchestration layer that produced them. For teams that just want cost and latency monitoring, Helicone is cheaper and simpler. For teams that need to debug agent behavior, LangSmith's depth wins.

LangSmith vs Braintrust

Braintrust is focused on evaluation and has more sophisticated eval tooling than LangSmith, including better statistical analysis and a testing workflow that feels more like software CI than ad-hoc experimentation. LangSmith has better production monitoring. Teams that prioritize systematic evaluation tend to prefer Braintrust. Teams that prioritize production debugging and monitoring tend to prefer LangSmith.

Who should use LangSmith

LangSmith makes the most sense for specific situations:

Teams running LangChain or LangGraph in production. The zero-config automatic tracing is a genuine competitive advantage. If you're already paying for LangChain's ecosystem, LangSmith's integration depth is hard to match.

Teams who want evaluation and monitoring in one place. LangSmith's unified approach, traces feeding into datasets, datasets feeding into eval experiments, eval results visible alongside production monitoring, reduces the number of tools you have to manage.

Startups and small teams who want fast setup. The free tier is usable, setup takes minutes, and the UI is polished enough that you can share traces with non-technical stakeholders without confusion.

LangSmith is harder to recommend for teams using other frameworks, teams with strict data residency requirements that can't afford enterprise self-hosting, or high-volume production workloads where per-trace pricing compounds quickly.

The verdict

LangSmith is the best observability tool for LangChain teams, and it's not particularly close. The automatic tracing, the polished UI, the prompt Hub, and the dataset-based evaluation workflow are all well-executed. For teams in the LangChain ecosystem, the question isn't whether to use LangSmith but whether the free tier is enough or whether you need Plus.

For teams outside the LangChain ecosystem, the calculation is different. The instrumentation is more manual, the pricing is less predictable at scale, and the self-hosting restriction matters if you have data residency requirements. In those cases, Langfuse or Arize Phoenix is usually the better fit.

Key features

Automatic tracing for LangChain, LangGraph, and LlamaIndex with zero instrumentation
Run-level debugging showing full input/output for every chain and LLM call
Dataset creation from production traces for regression testing
LLM-as-judge and custom evaluators with scoring dashboards
Prompt versioning and deployment with the LangChain Hub
Online evaluation on production traces with configurable sampling
Monitoring dashboards for latency, cost, error rate, and feedback
Annotation queues for human review and labeling workflows

Frequently Asked Questions

What is LangSmith?

LangSmith is LangChain's commercial observability and evaluation platform for LLM applications. It captures traces of every LangChain, LangGraph, or LlamaIndex run showing the full input, output, and intermediate steps for each chain node and LLM call. Beyond tracing, it offers dataset-based evaluation, LLM-as-judge scoring, prompt versioning through the LangChain Hub, and production monitoring dashboards. It's available as a cloud service with a free developer tier.

Is LangSmith free?

LangSmith has a free Developer plan that includes 5,000 traces per month and most core features. This is enough for active development and small production workloads. The Plus plan at $39 per month raises the trace limit to 50,000 and adds longer data retention. Teams doing high-volume production monitoring typically move to the Teams plan at $299 per month. Self-hosting is not available on free or Plus plans and requires an enterprise agreement.

How does LangSmith compare to Langfuse?

LangSmith has tighter integration with the LangChain ecosystem and a more polished UI for teams already using LangChain primitives. Langfuse is framework-agnostic, MIT-licensed, and self-hostable without an enterprise agreement. If your stack is LangChain or LangGraph, LangSmith's zero-config tracing gives it a clear edge in developer experience. If you're using other frameworks, need self-hosting for data residency reasons, or want predictable flat-rate pricing, Langfuse is the better fit.

Can I use LangSmith without LangChain?

Yes, though it requires more work. LangSmith provides a Python and TypeScript SDK that lets you create traces and spans manually from any application. The zero-config automatic tracing only works with LangChain, LangGraph, and a handful of other supported frameworks. For custom stacks or direct API calls, you instrument your code using the SDK's context managers or decorators, similar to how you'd use Langfuse or any other observability SDK.

What is the LangChain Hub?

The LangChain Hub is LangSmith's prompt registry. You can store prompt templates in the Hub, version them, and pull the active version into your LangChain application at runtime with a single function call. This decouples prompt iteration from code deployments and lets you roll back a bad prompt without a new release. The Hub also serves as a public repository where the community shares prompt templates, though most teams use it for private team prompts rather than public sharing.