Helicone
Open-source LLM observability proxy that logs every API call with zero code changes
Helicone is an open-source LLM observability proxy that sits between your application and LLM providers. Change one URL, get instant logging of every API call with cost tracking, latency, and full request/response storage. Available as a managed cloud service or self-hosted under Apache-2.0. Also adds caching, rate limiting, and basic prompt management on top of the proxy.
Most observability tools ask you to instrument your code. Add a decorator here, wrap a function there, configure a callback handler, deal with SDK dependencies. Helicone takes the opposite approach: change one URL and get instant logging for every LLM call your application makes, with zero other changes.
That single design choice explains both why Helicone is popular and where its limits are. The proxy model is the fastest path to basic LLM observability. It's also inherently limited to what a proxy can see: individual API calls, not the orchestration logic that produced them.
What Helicone is
Helicone is an open-source LLM observability proxy. The company was founded in 2023 and the repository at Helicone/helicone has crossed 10,000 GitHub stars. It's available as a managed cloud service and as a self-hosted deployment under the Apache-2.0 license.
The core product is simple: instead of calling api.openai.com, you call oai.helicone.ai. Helicone intercepts the request, logs it, forwards it to OpenAI, and returns the response. You get a dashboard showing every API call with timing, token counts, calculated cost, and the full prompt and completion stored for inspection.
Beyond the basic proxy, Helicone has added:
- Caching to avoid re-running identical prompts
- Rate limiting to control usage per user, session, or API key
- Custom properties for filtering and segmenting your logs
- Prompt versioning for managing prompt templates outside your codebase
- Gateway mode for load balancing across providers and failing over when one is down
The proxy integration
For OpenAI, the change is one line:
from openai import OpenAI
client = OpenAI(
api_key="your-openai-api-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer your-helicone-api-key",
}
)
# Everything else is identical to the standard OpenAI SDK
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is the boiling point of water?"}]
)
For Anthropic:
import anthropic
client = anthropic.Anthropic(
api_key="your-anthropic-api-key",
base_url="https://anthropic.helicone.ai",
default_headers={
"Helicone-Auth": "Bearer your-helicone-api-key",
}
)
If you're using HTTP directly, a reverse proxy, or any tool that makes LLM API calls through configurable endpoints, you change the base URL in your configuration. That's the entire integration. No new imports, no decorator, no SDK wrapper.
Custom properties for filtering
Raw request logs get noisy quickly. Custom properties let you attach metadata to requests so you can filter and segment them in the dashboard:
client = OpenAI(
api_key="your-openai-api-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer your-helicone-api-key",
"Helicone-Property-User-ID": "user-12345",
"Helicone-Property-Feature": "search",
"Helicone-Property-Environment": "production",
}
)
These headers pass through to Helicone's logging layer without affecting the API response. In the dashboard, you can filter requests by any property: show me all requests from user-12345, show me all search feature requests, compare production versus staging. For multi-tenant applications, tagging requests by user ID is especially useful for understanding which users drive cost.
Caching
Helicone's cache layer can return stored responses for identical or semantically similar prompts without making an API call:
client = OpenAI(
api_key="your-openai-api-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer your-helicone-api-key",
"Helicone-Cache-Enabled": "true",
"Helicone-Cache-Max-Age": "3600", # cache for 1 hour
}
)
Exact match caching is straightforward: the same prompt returns the same response from cache. Helicone also supports a bucket mode that allows some prompt variation within a cache hit, though this requires configuring bucket size and is less predictable.
For applications where users frequently ask the same questions, caching can cut LLM costs significantly. An FAQ bot where 30% of questions are near-identical saves 30% in API costs. For creative applications where every request should produce a fresh response, caching is the wrong tool.
Rate limiting
You can configure per-user rate limits through the Helicone API or dashboard:
client = OpenAI(
api_key="your-openai-api-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer your-helicone-api-key",
"Helicone-Property-User-ID": "user-12345",
"Helicone-RateLimit-Policy": "100;w=3600", # 100 requests per hour
}
)
When a user hits their limit, Helicone returns a 429 before forwarding the request to the provider, so no LLM cost is incurred. For multi-tenant apps where a small number of heavy users could consume a disproportionate share of your LLM budget, this is a practical guardrail that takes minutes to add.
Prompt versioning
Helicone's Prompt SDK lets you manage prompt templates outside your application code:
from helicone.openai_async import openai, Meta
response = await openai.ChatCompletion.acreate(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{
"helicone-prompt-id": "product-description",
"text": "Write a product description for: {product_name}",
"type": "text"
}
]
}
],
helicone_meta=Meta(
prompt_id="product-description",
prompt_version="v2"
)
)
The prompt management feature stores template versions in Helicone and tracks which version produced which responses. It's simpler than LangSmith's Hub or Langfuse's prompt registry, but functional for teams that need basic prompt versioning without a separate tool.
What Helicone doesn't cover
The proxy architecture is both Helicone's main advantage and its hard limit. Helicone sees what goes in and out of individual LLM API calls. It doesn't see the orchestration layer.
If you're running a LangGraph agent that makes three tool calls before generating a final response, Helicone shows you three separate LLM API calls. It doesn't know they're part of the same agent run, it doesn't know which tool results influenced which call, and it can't show you the chain of reasoning that connected them.
For debugging agent behavior, this is a significant limitation. When your agent produces a wrong answer, knowing that the third LLM call had a 2.1s latency doesn't tell you why the agent's reasoning went sideways. That requires span-level tracing of the orchestration logic, which is what Langfuse and LangSmith provide.
Helicone is the right tool when you want to understand your LLM API usage in aggregate: what is this costing, which models are slow, which users are heavy consumers, where are the errors. It's not the right tool for understanding why a specific agent run produced a bad result.
Helicone vs the alternatives
Helicone vs Langfuse
Different tools for overlapping problems. Helicone wins on setup time and on gateway features like caching and rate limiting. Langfuse wins on agent-level tracing depth and on evaluation tooling. Teams often use both: Helicone at the API level for cost monitoring and caching, Langfuse for debugging specific agent runs. If you can only pick one and you run complex agents, pick Langfuse. If you want quick cost visibility across a simple application, Helicone is faster.
Helicone vs LangSmith
LangSmith is designed for LangChain teams and provides deep chain-level tracing. Helicone is framework-agnostic and works at the HTTP level. For teams not using LangChain, Helicone's proxy is simpler to set up and has no framework dependency. For LangChain teams, LangSmith's automatic tracing is far more informative than Helicone's API-level logs.
Helicone vs Arize Phoenix
Arize Phoenix is a deeper evaluation and tracing platform with OpenTelemetry-based span tracing. Phoenix wins on evaluation capabilities and agent debugging depth. Helicone wins on zero-configuration setup and on gateway features Phoenix doesn't offer. These products don't really compete for the same primary use case.
Pricing in practice
The managed cloud service has a free plan that includes 10,000 requests per month. That's enough for development and light production use. The Growth plan at $20/month adds 100,000 requests and longer log retention. The Pro plan at $200/month targets small to medium production workloads.
At scale, Helicone's per-request pricing can compound. Teams running millions of LLM calls per month often find self-hosting more economical. The self-hosted stack uses Clickhouse for storing requests and Supabase for metadata, which requires more operational setup than Langfuse's simpler Postgres-based deployment but handles high request volumes efficiently.
For the average team monitoring an application that makes a few hundred thousand LLM calls per month, the Growth plan at $20/month is reasonable.
Who should use Helicone
Helicone makes the most sense for specific situations:
Teams adding observability to an existing application quickly. If you have a running LLM application and you want cost visibility and request logging today, Helicone's proxy integration is the fastest path. Change a URL and you're done.
Applications using direct LLM API calls without an agent framework. If you're calling OpenAI or Anthropic directly, a chatbot, a content generation tool, a simple classification endpoint, Helicone gives you complete request logging with minimal overhead.
Multi-tenant applications that need per-user rate limiting and cost tracking. The custom properties and rate limiting features are genuinely useful here and require no instrumentation beyond the base URL change.
Teams who want caching as part of their observability setup. The fact that Helicone combines logging with caching and rate limiting in the same proxy is convenient. Adding a separate caching layer would require another tool.
Helicone is harder to recommend when you're building complex agents that need orchestration-level debugging, when you're already using LangChain and LangSmith is the obvious fit, or when you need sophisticated evaluation workflows.
The verdict
Helicone does what it says. The proxy model delivers instant observability with an integration that takes under five minutes. The cost tracking is accurate, the dashboard is clear, and the caching and rate limiting features are genuinely useful additions that other observability tools don't offer in the same package.
The limit is real: if you need to understand agent behavior at the orchestration level, you'll outgrow Helicone. It's a window into your LLM API usage, not a window into your agent's reasoning. Many teams use it alongside a deeper tracing tool for exactly that reason.
For simple applications, Helicone is a near-perfect fit. For complex agent systems, treat it as one layer in a larger observability stack rather than the complete solution.
Key features
- Proxy-based observability with a single base URL change, no SDK required
- Real-time cost and token tracking across OpenAI, Anthropic, Azure, and 30+ providers
- Request logging with full prompt and completion storage
- Caching layer to reduce duplicate LLM calls and cut costs
- Rate limiting and usage controls per user or API key
- Custom properties for filtering and segmenting requests
- Prompt versioning and testing via the Helicone Prompt SDK
- Gateway mode with load balancing and provider fallback