Agentbrisk

How to Monitor AI Agents in Production: Logging, Tracing, and Alerting

April 15, 2026 · Editorial Team · 9 min read · observabilityproductionmonitoring

Shipping an AI agent to production without monitoring is like deploying a web service with no error logs and no uptime alerts. It works fine in the demo. In production, something will go wrong, a model hallucination, an unexpected tool failure, a slow API response, a cost spike from a runaway session, and you will have no idea what happened or why.

This guide covers what you actually need to monitor in production AI agents, how to set it up, and which tools handle which parts well. I'll reference Langfuse, LangSmith, Helicone, and OpenTelemetry because those are the tools I see used most in real production systems. The patterns apply broadly regardless of which you pick.


What is different about monitoring AI agents

Standard application monitoring (APM) tools like Datadog and New Relic track latency, error rates, and throughput. Those metrics still matter for AI agents, but they miss most of what actually goes wrong.

The distinctive failure modes in AI agents are:

Quality drift. The model's outputs degrade slowly, not catastrophically wrong, just subtly worse, and you do not notice until a user complains. Latency and error rate metrics tell you nothing about this.

Context accumulation. An agent session gets increasingly expensive as context grows. A session that cost $0.50 at 10 minutes costs $5 at 2 hours because every subsequent call includes the full accumulated context.

Tool failure cascades. An agent calls a tool that returns an unexpected format. The agent interprets the malformed output incorrectly, takes a wrong action, and by the time you see a failure, it's three steps downstream from the actual cause.

Silent model errors. The model produces output that parses correctly but is semantically wrong. No exception is raised. No 500 error is logged. The agent just gives the user a confident, incorrect answer.

These require observability primitives that standard APM tools were not built for: trace-level inspection of model inputs and outputs, quality evaluation against reference answers, session-level cost tracking, and alerting on semantic correctness rather than just HTTP errors.


Layer 1: Structured logging of inputs and outputs

The first thing to instrument is the LLM call itself. Every call to a model should log at minimum:

  • The full prompt (system prompt + messages)
  • The full response
  • Token counts (input, output, total)
  • Model name and version
  • Latency
  • Any errors or stop reasons (stop, max_tokens, tool_use)
  • A session or trace ID to group related calls

This sounds obvious but a surprising number of production agents log only the final user-facing output, not the intermediate model calls. When debugging an agent that took a wrong turn on step 3 of a 10-step task, having only the step-10 output is useless.

Here is the minimal logging pattern in Python:

import time
import uuid

def call_model_with_logging(client, messages, model, system_prompt, logger):
    trace_id = str(uuid.uuid4())
    start = time.time()

    try:
        response = client.messages.create(
            model=model,
            max_tokens=4096,
            system=system_prompt,
            messages=messages
        )

        logger.info({
            "trace_id": trace_id,
            "model": model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "latency_ms": int((time.time() - start) * 1000),
            "stop_reason": response.stop_reason,
            "prompt_preview": str(messages[-1])[:500],  # truncate for log size
            "response_preview": response.content[0].text[:500],
        })

        return response, trace_id

    except Exception as e:
        logger.error({
            "trace_id": trace_id,
            "model": model,
            "error": str(e),
            "latency_ms": int((time.time() - start) * 1000),
        })
        raise

Two things to be careful about here. First, full prompt logging can be expensive in terms of log storage if you have long system prompts. Log a hash or a truncated preview unless you specifically need the full text for debugging. Second, if you are logging to a third-party service, make sure your prompts do not contain PII that should not leave your infrastructure.


Layer 2: Distributed tracing for multi-step agents

Logging individual LLM calls is necessary but not sufficient for agents that run multiple steps. You need to see the full execution tree: which tools were called, what they returned, how long each step took, and how the output of step 2 affected the inputs to step 3.

This is what distributed tracing is for, and it is where tools like Langfuse and LangSmith add real value.

Using Langfuse for agent tracing

Langfuse uses a hierarchical trace model. A top-level "trace" corresponds to a user request or session. Within the trace, you nest "spans" for each significant step: LLM calls, tool calls, retrieval steps, and subagent invocations.

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse()

@observe()
def run_agent_session(user_query: str):
    # Everything inside this function is traced
    context = retrieve_relevant_docs(user_query)
    result = call_agent_with_context(user_query, context)
    return result

@observe()
def retrieve_relevant_docs(query: str):
    # This nested call appears as a child span
    docs = vector_store.search(query, top_k=5)
    langfuse_context.update_current_observation(
        metadata={"doc_count": len(docs), "query": query}
    )
    return docs

The @observe() decorator captures the function name, inputs, outputs, and timing automatically. In the Langfuse UI, you see the full trace tree: the session at the top, each function call nested underneath, and every LLM call within each function with its full prompt and response.

Using LangSmith for LangGraph agents

If your agent uses LangGraph, LangSmith tracing integrates automatically. Set the environment variables and every graph execution is traced at the graph node level:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your_key
export LANGCHAIN_PROJECT=your_project_name

LangSmith shows you the graph execution: which nodes ran, what state was passed between them, where the graph branched, and how long each node took. For complex agents with conditional routing, this is significantly more useful than flat logs.

OpenTelemetry for infrastructure-first teams

If your engineering team already runs distributed tracing via OpenTelemetry (with Jaeger, Tempo, or a managed OTEL provider), you can use the OpenTelemetry Python SDK to add AI agent spans that live alongside your existing service traces.

The Arize Phoenix instrumentation library ships OTEL-compatible auto-instrumentation for OpenAI, Anthropic, and LangChain:

from opentelemetry import trace
from openinference.instrumentation.anthropic import AnthropicInstrumentor

AnthropicInstrumentor().instrument()

# All Anthropic API calls are now automatically traced to your OTEL backend

This approach works well when you want AI traces to live inside your existing observability infrastructure rather than in a separate AI-specific tool.


Layer 3: Cost tracking and session budgets

Cost spikes are a real production issue. An agent session that was expected to cost $0.50 can cost $50 if the agent gets into a loop, if context accumulates without compaction, or if a tool error causes repeated retries against an expensive model.

The right response is to track cost per session and set per-session budgets.

Track session cost by summing token usage across all calls in the session:

class SessionCostTracker:
    def __init__(self, budget_usd=5.0):
        self.input_tokens = 0
        self.output_tokens = 0
        self.budget = budget_usd

    def add_usage(self, input_tokens: int, output_tokens: int, model: str):
        self.input_tokens += input_tokens
        self.output_tokens += output_tokens

    def estimated_cost_usd(self, model="claude-3-7-sonnet"):
        # Approximate rates per May 2026
        rates = {
            "claude-3-7-sonnet": (0.003, 0.015),  # per 1K tokens
            "claude-4-opus": (0.015, 0.075),
        }
        input_rate, output_rate = rates.get(model, (0.003, 0.015))
        return (self.input_tokens / 1000 * input_rate) + \
               (self.output_tokens / 1000 * output_rate)

    def over_budget(self, model="claude-3-7-sonnet") -> bool:
        return self.estimated_cost_usd(model) > self.budget

When a session hits its budget, the agent can gracefully finish its current step and stop, rather than running until the user ends it. This is especially important for autonomous agents where no human is actively watching the session.

Helicone handles this well at the API proxy level, you can set cost budgets per user or per session in Helicone's dashboard without changing your application code.


Layer 4: Quality evaluation

Latency and cost tell you how the agent is performing operationally. They tell you nothing about whether the outputs are any good. For that, you need evaluation.

There are three practical approaches in production:

1. LLM-as-judge. After each agent response, call a cheap fast model (GPT-5 nano, Claude 3.5 Haiku) with an evaluation prompt that asks whether the response answered the user's question correctly, whether it stayed on task, and whether it cited sources accurately. Log the score. Alert if the score drops below a threshold.

def evaluate_response(user_query: str, agent_response: str, client) -> float:
    eval_prompt = f"""Rate the following AI response on a scale of 0-10.

User question: {user_query}
AI response: {agent_response}

Criteria:
- Does it directly answer the question? (0-4 points)
- Is it factually accurate based on what you know? (0-3 points)
- Is it appropriately concise? (0-3 points)

Return a JSON object: {{"score": <number>, "reason": "<brief explanation>"}}"""

    result = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    # Parse and return score
    import json
    return json.loads(result.content[0].text)["score"]

2. Reference-based evaluation on a test set. Maintain a dataset of 50-200 representative queries with known good answers. Run your agent against this dataset on every deployment. If the pass rate drops by more than 5%, block the deploy and investigate.

Both Langfuse and Braintrust have built-in dataset management for this. You upload your test cases, run evaluations via their SDK, and see pass rates across deployments in a dashboard.

3. User feedback signals. Thumbs up/down buttons, explicit ratings, and correction signals from users are the most reliable quality data you can get. They are noisier than automated evaluators but they represent real user satisfaction, not a proxy metric. Pipe user feedback into your observability tool as scores on the relevant traces.


Layer 5: Alerting

Monitoring without alerting is just expensive data storage. The alerts that actually matter for AI agents:

Cost alerts. Alert when daily spend exceeds your expected budget by more than 20%. Alert when a single session's cost exceeds a threshold (e.g., $10 for a session that should cost $0.50). These catch both gradual drift and acute runaway scenarios.

Latency alerts. Alert when median session latency or P95 latency crosses a threshold. Slow model responses are usually a model provider issue, but they can also indicate context bloat, the model taking longer to process an unusually large context.

Error rate alerts. Alert when tool call error rates exceed, say, 5% in a rolling 15-minute window. Tool errors cascade in ways that simple model errors do not.

Quality alerts. Alert when your LLM-as-judge score average drops below your quality threshold. This is the alert most teams forget to set up, and it is the most important one for catching gradual quality degradation.

Most observability platforms (Langfuse, LangSmith, Helicone) have built-in alerting or integrate with PagerDuty and Slack. For OpenTelemetry setups, the alerts live in your existing observability platform alongside your other service alerts, which is one of the main advantages of the OTEL approach.


A practical monitoring stack

For most production agents, the monitoring stack I'd recommend:

Tracing: Langfuse (if you want open-source flexibility or self-hosting) or LangSmith (if you are on LangGraph). Either gives you the trace depth you need.

Cost tracking: Langfuse has this built in. If you need it separately, Helicone's proxy is the fastest path.

Quality evaluation: Start with LLM-as-judge on live traffic and add a reference dataset as you accumulate good test cases. Langfuse and LangSmith both support this.

Alerting: Wire your observability tool to Slack at minimum. Add PagerDuty or equivalent for anything where quality or cost alerts need after-hours response.

This is not an expensive stack. Langfuse's free tier covers early-stage production. LangSmith's Developer plan at $39/month handles moderate volume. The cost of not monitoring, a runaway session, a quality regression you discover from user complaints, a model failure you cannot reproduce because you have no logs, is typically much higher.

Start simple. Instrument every LLM call with inputs, outputs, and token counts. Add session cost tracking. Add one quality evaluator. Then grow the sophistication as you learn what questions you cannot answer from the data you have.

Search