TypeScript Apache-2.0 observabilitygatewayevaluation

Helicone

Open-source LLM observability proxy that logs every API call with zero code changes

Helicone is an open-source LLM observability proxy that sits between your application and LLM providers. Change one URL, get instant logging of every API call with cost tracking, latency, and full request/response storage. Available as a managed cloud service or self-hosted under Apache-2.0. Also adds caching, rate limiting, and basic prompt management on top of the proxy.

Most observability tools ask you to instrument your code. Add a decorator here, wrap a function there, configure a callback handler, deal with SDK dependencies. Helicone takes the opposite approach: change one URL and get instant logging for every LLM call your application makes, with zero other changes.

That single design choice explains both why Helicone is popular and where its limits are. The proxy model is the fastest path to basic LLM observability. It's also inherently limited to what a proxy can see: individual API calls, not the orchestration logic that produced them.

What Helicone is

Helicone is an open-source LLM observability proxy. The company was founded in 2023 and the repository at Helicone/helicone has crossed 10,000 GitHub stars. It's available as a managed cloud service and as a self-hosted deployment under the Apache-2.0 license.

The core product is simple: instead of calling api.openai.com, you call oai.helicone.ai. Helicone intercepts the request, logs it, forwards it to OpenAI, and returns the response. You get a dashboard showing every API call with timing, token counts, calculated cost, and the full prompt and completion stored for inspection.

Beyond the basic proxy, Helicone has added:

Caching to avoid re-running identical prompts
Rate limiting to control usage per user, session, or API key
Custom properties for filtering and segmenting your logs
Prompt versioning for managing prompt templates outside your codebase
Gateway mode for load balancing across providers and failing over when one is down

The proxy integration

For OpenAI, the change is one line:

from openai import OpenAI

client = OpenAI(
    api_key="your-openai-api-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-api-key",
    }
)

# Everything else is identical to the standard OpenAI SDK
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the boiling point of water?"}]
)

For Anthropic:

import anthropic

client = anthropic.Anthropic(
    api_key="your-anthropic-api-key",
    base_url="https://anthropic.helicone.ai",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-api-key",
    }
)

If you're using HTTP directly, a reverse proxy, or any tool that makes LLM API calls through configurable endpoints, you change the base URL in your configuration. That's the entire integration. No new imports, no decorator, no SDK wrapper.

Custom properties for filtering

Raw request logs get noisy quickly. Custom properties let you attach metadata to requests so you can filter and segment them in the dashboard:

client = OpenAI(
    api_key="your-openai-api-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-api-key",
        "Helicone-Property-User-ID": "user-12345",
        "Helicone-Property-Feature": "search",
        "Helicone-Property-Environment": "production",
    }
)

These headers pass through to Helicone's logging layer without affecting the API response. In the dashboard, you can filter requests by any property: show me all requests from user-12345, show me all search feature requests, compare production versus staging. For multi-tenant applications, tagging requests by user ID is especially useful for understanding which users drive cost.

Caching

Helicone's cache layer can return stored responses for identical or semantically similar prompts without making an API call:

client = OpenAI(
    api_key="your-openai-api-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-api-key",
        "Helicone-Cache-Enabled": "true",
        "Helicone-Cache-Max-Age": "3600",  # cache for 1 hour
    }
)

Exact match caching is straightforward: the same prompt returns the same response from cache. Helicone also supports a bucket mode that allows some prompt variation within a cache hit, though this requires configuring bucket size and is less predictable.

For applications where users frequently ask the same questions, caching can cut LLM costs significantly. An FAQ bot where 30% of questions are near-identical saves 30% in API costs. For creative applications where every request should produce a fresh response, caching is the wrong tool.

Rate limiting

You can configure per-user rate limits through the Helicone API or dashboard:

client = OpenAI(
    api_key="your-openai-api-key",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-api-key",
        "Helicone-Property-User-ID": "user-12345",
        "Helicone-RateLimit-Policy": "100;w=3600",  # 100 requests per hour
    }
)

When a user hits their limit, Helicone returns a 429 before forwarding the request to the provider, so no LLM cost is incurred. For multi-tenant apps where a small number of heavy users could consume a disproportionate share of your LLM budget, this is a practical guardrail that takes minutes to add.

Prompt versioning

Helicone's Prompt SDK lets you manage prompt templates outside your application code:

from helicone.openai_async import openai, Meta

response = await openai.ChatCompletion.acreate(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "helicone-prompt-id": "product-description",
                    "text": "Write a product description for: {product_name}",
                    "type": "text"
                }
            ]
        }
    ],
    helicone_meta=Meta(
        prompt_id="product-description",
        prompt_version="v2"
    )
)

The prompt management feature stores template versions in Helicone and tracks which version produced which responses. It's simpler than LangSmith's Hub or Langfuse's prompt registry, but functional for teams that need basic prompt versioning without a separate tool.

What Helicone doesn't cover

The proxy architecture is both Helicone's main advantage and its hard limit. Helicone sees what goes in and out of individual LLM API calls. It doesn't see the orchestration layer.

If you're running a LangGraph agent that makes three tool calls before generating a final response, Helicone shows you three separate LLM API calls. It doesn't know they're part of the same agent run, it doesn't know which tool results influenced which call, and it can't show you the chain of reasoning that connected them.

For debugging agent behavior, this is a significant limitation. When your agent produces a wrong answer, knowing that the third LLM call had a 2.1s latency doesn't tell you why the agent's reasoning went sideways. That requires span-level tracing of the orchestration logic, which is what Langfuse and LangSmith provide.

Helicone is the right tool when you want to understand your LLM API usage in aggregate: what is this costing, which models are slow, which users are heavy consumers, where are the errors. It's not the right tool for understanding why a specific agent run produced a bad result.

Helicone vs the alternatives

Helicone vs Langfuse

Different tools for overlapping problems. Helicone wins on setup time and on gateway features like caching and rate limiting. Langfuse wins on agent-level tracing depth and on evaluation tooling. Teams often use both: Helicone at the API level for cost monitoring and caching, Langfuse for debugging specific agent runs. If you can only pick one and you run complex agents, pick Langfuse. If you want quick cost visibility across a simple application, Helicone is faster.

Helicone vs LangSmith

LangSmith is designed for LangChain teams and provides deep chain-level tracing. Helicone is framework-agnostic and works at the HTTP level. For teams not using LangChain, Helicone's proxy is simpler to set up and has no framework dependency. For LangChain teams, LangSmith's automatic tracing is far more informative than Helicone's API-level logs.

Helicone vs Arize Phoenix

Arize Phoenix is a deeper evaluation and tracing platform with OpenTelemetry-based span tracing. Phoenix wins on evaluation capabilities and agent debugging depth. Helicone wins on zero-configuration setup and on gateway features Phoenix doesn't offer. These products don't really compete for the same primary use case.

Pricing in practice

The managed cloud service has a free plan that includes 10,000 requests per month. That's enough for development and light production use. The Growth plan at $20/month adds 100,000 requests and longer log retention. The Pro plan at $200/month targets small to medium production workloads.

At scale, Helicone's per-request pricing can compound. Teams running millions of LLM calls per month often find self-hosting more economical. The self-hosted stack uses Clickhouse for storing requests and Supabase for metadata, which requires more operational setup than Langfuse's simpler Postgres-based deployment but handles high request volumes efficiently.

For the average team monitoring an application that makes a few hundred thousand LLM calls per month, the Growth plan at $20/month is reasonable.

Who should use Helicone

Helicone makes the most sense for specific situations:

Teams adding observability to an existing application quickly. If you have a running LLM application and you want cost visibility and request logging today, Helicone's proxy integration is the fastest path. Change a URL and you're done.

Applications using direct LLM API calls without an agent framework. If you're calling OpenAI or Anthropic directly, a chatbot, a content generation tool, a simple classification endpoint, Helicone gives you complete request logging with minimal overhead.

Multi-tenant applications that need per-user rate limiting and cost tracking. The custom properties and rate limiting features are genuinely useful here and require no instrumentation beyond the base URL change.

Teams who want caching as part of their observability setup. The fact that Helicone combines logging with caching and rate limiting in the same proxy is convenient. Adding a separate caching layer would require another tool.

Helicone is harder to recommend when you're building complex agents that need orchestration-level debugging, when you're already using LangChain and LangSmith is the obvious fit, or when you need sophisticated evaluation workflows.

The verdict

Helicone does what it says. The proxy model delivers instant observability with an integration that takes under five minutes. The cost tracking is accurate, the dashboard is clear, and the caching and rate limiting features are genuinely useful additions that other observability tools don't offer in the same package.

The limit is real: if you need to understand agent behavior at the orchestration level, you'll outgrow Helicone. It's a window into your LLM API usage, not a window into your agent's reasoning. Many teams use it alongside a deeper tracing tool for exactly that reason.

For simple applications, Helicone is a near-perfect fit. For complex agent systems, treat it as one layer in a larger observability stack rather than the complete solution.

Key features

Proxy-based observability with a single base URL change, no SDK required
Real-time cost and token tracking across OpenAI, Anthropic, Azure, and 30+ providers
Request logging with full prompt and completion storage
Caching layer to reduce duplicate LLM calls and cut costs
Rate limiting and usage controls per user or API key
Custom properties for filtering and segmenting requests
Prompt versioning and testing via the Helicone Prompt SDK
Gateway mode with load balancing and provider fallback

Frequently Asked Questions

What is Helicone?

Helicone is an open-source LLM observability proxy. Instead of calling the OpenAI or Anthropic API directly, you route your requests through Helicone's endpoint. Helicone logs the request and response, calculates cost and latency, and forwards the call to the original provider. The result is instant observability with no SDK changes and no code instrumentation. Beyond logging, Helicone adds caching, rate limiting, and basic prompt management on top of the proxy layer.

Does Helicone add latency?

Yes, though it's typically small. In the managed cloud tier, Helicone adds roughly 30-80ms per request in most regions. For interactive applications where users are waiting for a streaming response, this overhead is usually unnoticeable. For latency-sensitive batch workloads it can matter. If latency is critical, Helicone's self-hosted option eliminates the network round-trip since the proxy runs in your own infrastructure.

How does Helicone compare to Langfuse?

Helicone and Langfuse solve different parts of the observability problem. Helicone is a proxy: it intercepts API calls and logs them at the HTTP level. You see individual LLM calls with their inputs and outputs. Langfuse is an SDK-based tracer: you instrument your application code and get span-level visibility into agent orchestration, chains, and multi-step workflows. For basic cost and usage monitoring, Helicone is easier to set up. For debugging complex agent behavior, Langfuse's span-level tracing gives you more to work with.

Can I self-host Helicone?

Yes. Helicone is Apache-2.0 licensed and designed for self-hosting. The self-hosted stack runs on Docker Compose and uses Clickhouse for request storage and Supabase for metadata. The repository at github.com/Helicone/helicone has setup documentation for local and production deployments. Self-hosting removes the per-request pricing and keeps all request data on your infrastructure, which matters for teams with data residency requirements.

Does Helicone work with providers other than OpenAI?

Yes. Helicone supports OpenAI, Anthropic, Azure OpenAI, Google Gemini, AWS Bedrock, Cohere, and any provider with an OpenAI-compatible API. For providers without a dedicated proxy endpoint, Helicone offers a gateway mode where you configure the target provider URL. Most LLM providers and local inference servers like Ollama work through this gateway mode.