AI Agent Context Windows Explained: Why They Matter in 2026
If you have ever had an AI assistant lose track of what you said three messages ago, or watched a coding agent suddenly forget that a function exists in a file it was looking at twenty minutes earlier, you have hit a context window limit. It's one of the most practically important concepts for anyone who works with AI agents, and it is also one of the most frequently misunderstood.
This guide explains what context windows are, why they matter specifically for agents (more than for chatbots), what the current sizes look like in 2026, and how the better tools handle the situation when context runs out.
What a context window actually is
Every language model has a working memory of sorts. When it processes a request, it can only "see" a certain number of tokens at once. Everything outside that window does not exist as far as the model is concerned. That limit is the context window.
Tokens are roughly word-sized chunks of text. "context window" is three tokens. A sentence is typically 10-20 tokens. A paragraph might be 100 tokens. A book is hundreds of thousands of tokens.
So a model with a 200,000 token context window can process about 150,000 words in a single pass, roughly the equivalent of a full novel. A model with a 1 million token context window can hold several books, or a large software codebase, or months of email threads.
The crucial thing to understand is that the context window is not just for your messages. Everything the model uses to answer goes into it: your conversation history, any documents you've attached, tool outputs, system instructions, the agent's working notes. In an agent session, context fills up from multiple directions at once.
Why context windows matter more for agents than for chatbots
A standard chatbot interaction is short. You ask a question, it answers, you ask another question. Even if the window is small, a typical conversation fits comfortably. Context limits are rarely an issue.
An agent session is completely different. When you give an agent a complex task, it might:
- Read a dozen source files to understand the codebase structure
- Run a test suite and examine the output
- Make several edits and re-run to check for regressions
- Look up documentation in multiple places
- Maintain notes about what it has tried and what failed
Each of these steps produces output that goes into the context window. A long coding session can generate tens of thousands of tokens of context without a single line of the agent's work being wasted. It's all real, necessary information, but the window fills up.
What happens when the window fills up depends on the tool. The worst case: the model starts dropping early context to fit new content, and the agent loses track of its original goal, the files it already read, or the decisions it made earlier. I've seen coding agents start rewriting code they already fixed because the context of the original problem dropped out of the window.
The better tools handle this more gracefully, which I'll cover in the last section.
Current context window sizes in 2026
The numbers have moved significantly over the past two years. Here is where the major models stand.
Claude (Anthropic): Claude 3.7 Sonnet has a 200,000 token context window. Claude 4 Opus extends this further, with some configurations supporting up to 1 million tokens through extended context features. Anthropic has invested heavily in this area because agentic use cases are central to their positioning, and long context is a core requirement for agentic tasks.
GPT-5 (OpenAI): GPT-5 supports around 400,000 tokens of context. This is a significant step up from GPT-4's 128K ceiling and handles most real-world agentic tasks without running out of room. The retrieval tools built into GPT-5's API can pull in additional context dynamically, which extends the effective working memory beyond the raw token limit.
Gemini 2.5 Pro (Google): Gemini's context window is the largest in production at 2 million tokens. This is not a marketing number, it is a real architectural choice. Google published research on the "needle in a haystack" problem showing that Gemini 2.5 Pro actually attends to information across the full 2M context rather than effectively ignoring the middle of very long documents, which has been a real failure mode for long-context models.
Llama 4 (Meta, open weights): Llama 4 variants support up to 1 million tokens, depending on the specific model configuration. This matters for the self-hosted and local model use cases, where developers running agents on their own infrastructure now have access to genuinely large context windows without paying API fees.
Smaller and older models: If you are working with models below the current frontier, Llama 3.x, older GPT-4 variants, Mistral models without extended context, you are typically working with 32K to 128K windows. These hit their limits quickly in complex agent sessions.
Token counting: a concrete example
To make these numbers tangible, here is what different context window sizes can hold in a software development context:
A 32K window (older models) holds roughly: a few hundred lines of code, the full test output for a medium file, and your conversation history from the last ten minutes. You can ask questions about a function or a module.
A 200K window (Claude 3.7 Sonnet) holds roughly: 10,000-15,000 lines of code, meaning a full small application. You can read an entire service's codebase at once.
A 1M window (Claude 4 Opus extended, Llama 4 large) holds roughly: 50,000-75,000 lines of code, plus documentation, test files, and a full agent conversation history. A significant microservice or a complete medium-sized application.
A 2M window (Gemini 2.5 Pro) holds roughly: an entire large codebase, something in the range of 100,000-150,000 lines, or multiple codebases simultaneously. This is genuinely new capability from a year ago.
How agents manage context
Even with larger windows, context management is not a solved problem. The best tools have specific strategies.
Summarization and compaction
When a context window starts to fill up, some agents summarize earlier portions of the conversation and replace the raw history with a compressed version. Claude Code calls this "auto-compact", when the context reaches a threshold, it produces a concise summary of what has happened so far: what the task is, what has been tried, what decisions were made, what the current state is. That summary replaces the full history, freeing up space for new work.
Done well, this works reasonably. The model loses some detail about earlier steps but retains the decisions and direction. Done poorly, important details get dropped and the agent starts making mistakes based on incomplete memory.
Retrieval augmented generation (RAG)
Rather than loading everything into context at once, some systems pull in relevant information on demand. The agent has access to a vector store or a search index of the codebase, documentation, or previous session notes. When it needs information, it retrieves the relevant chunks rather than having them in context the whole time.
This works well for read-heavy tasks where the agent needs to consult many reference sources but does not need all of them simultaneously. It works less well for tasks that require holding multiple pieces of related information in mind at once, the retrieval step can miss context that was relevant but not explicitly searched for.
Hierarchical memory
Some more sophisticated agent architectures maintain multiple levels of memory: a full detail log, a running summary, and a compressed long-term store. The agent actively manages what moves between levels, similar to how humans hold recent events in working memory and compress older memories into longer-term storage.
This approach is more complex to build and debug, but it is increasingly common in production agents that need to work over long time horizons. Letta (formerly MemGPT) is a framework specifically designed around this kind of memory architecture.
Explicit context control
Some tools let the developer or user explicitly control what goes into context. In Cursor, you can pin specific files to the context or explicitly include or exclude files from what the AI sees. In Aider, you control exactly which files are in the "chat context" at any time.
This manual approach puts more burden on the user but gives more predictable results. Experienced developers who know their codebase well can often manage context more effectively than an automated system by hand-selecting what the agent needs to see.
What to watch out for in practice
The forgetting problem: When an agent resets or compacts context, it may not remember that a particular approach already failed. If you are watching a coding agent make the same mistake twice, context loss is a likely cause.
Middle-of-context blindness: Research has shown that models are better at attending to information at the beginning and end of their context window than in the middle. For agents working with very long contexts, this means information buried in the middle of the session history may be effectively invisible. This is one reason the 2M window on Gemini is not automatically 10x better than a 200K window, the model needs to reliably attend to all of it.
Context != memory: A larger context window is not the same thing as persistent memory across sessions. When you start a new conversation with Claude Code or Cursor, the previous session's context is gone. Longer context windows make individual sessions more powerful, not cross-session continuity. For that, you need explicit memory systems.
Cost scales with context: API pricing for most models is per token, for both input and output. A 1 million token context is not free. An agent session that reads a large codebase into context and runs for an hour can generate significant API costs. For cloud vs. self-hosted decisions, this is one of the factors that makes local model hosting economically interesting for heavy agent use.
The bottom line
Context windows in 2026 are large enough to handle most real-world agentic tasks that were impractical two years ago. Gemini's 2M window, Claude 4 Opus's extended context, and GPT-5's 400K limit have meaningfully changed what agents can do in a single session.
But context limits still matter, and they matter most in the exact scenarios where agents are most useful: long-running tasks, large codebases, multi-step research that accumulates a lot of intermediate output. Understanding how a tool handles context limits, whether it compacts, retrieves, or just quietly drops history, is one of the most important things to check before trusting an agent with a complex task.
If you are choosing between agent tools and context management is a concern, look at the tool's compaction behavior and test it with a task long enough to actually trigger it. The behavior when context runs out tells you a lot about how the system was designed.