AI Coding Agent Benchmarks 2026: SWE-bench Scores and What They Miss
SWE-bench Verified has become the standard leaderboard for coding agents, and the scores matter: they're real measurements of model capability on real engineering tasks. But they also get misread. A tool that scores 72% on SWE-bench doesn't mean it can complete 72% of your engineering work autonomously. Understanding what the benchmark actually measures, and what it deliberately ignores, is what makes the scores interpretable.
What SWE-bench Verified measures
SWE-bench was created at Princeton in 2023. The original dataset contains 2,294 real GitHub issues from popular Python repositories: Django, Flask, numpy, scikit-learn, and similar projects. Each issue comes with a description and a test suite that verifies the fix.
The task: given the repository code and the issue description, produce a code change that makes the failing tests pass.
SWE-bench Verified is a human-curated subset of 500 tasks from the full dataset. The "Verified" designation means each task was validated by human experts to confirm that the issue is actually solvable and that the test suite accurately reflects the fix. The original SWE-bench had some ambiguous or incorrectly specified tasks; Verified removed those.
What makes this benchmark valuable: it's not multiple-choice questions or toy puzzles. These are real repository issues that real engineers filed, with real codebases, and real tests. Solving them requires understanding the codebase structure, identifying the root cause, writing a fix that doesn't break other tests, and formatting the change as a valid patch.
What makes it limited: it only measures bug fixing and small feature additions in existing Python codebases. It says nothing about writing new code from scratch, handling TypeScript or Go or Rust, architectural decisions, code review, or interacting with a developer over multiple sessions.
The numbers: SWE-bench Verified scores, May 2026
These are the published or estimated scores as of May 2026. Benchmark scores change with model updates, scaffold improvements, and evaluation methodology changes, so treat these as the current state rather than permanent rankings.
Claude Code (Anthropic), using Claude 4 Opus: 72.1% This is Anthropic's own agent scaffold optimized for Claude 4 Opus. Claude Code runs in your terminal with file system access and code execution. The high score reflects both the model quality and a scaffold designed specifically for this benchmark task type.
OpenAI Codex (GPT-5 based, scaffolded): 68.3% OpenAI's coding agent built on GPT-5. Slightly below Claude Code on this benchmark; the gap is within the range that could reverse with the next model or scaffold update.
Devin 2.0 (Cognition AI): ~65% Devin's score has improved substantially since the 2024 debut. The full development environment approach (browser, terminal, editor as separate tools) produces good results on tasks that require multi-file edits. Slightly lower than the frontier models on SWE-bench specifically.
Aider with Claude 4 Opus: ~68% Aider is an open-source coding assistant that can be pointed at different models. With Claude 4 Opus as the backend and the "architect" mode that separates planning from execution, Aider achieves scores competitive with proprietary agents. The open-source nature means you can inspect and modify the scaffold.
Cursor (agent mode, Claude 4 Opus backend): ~58% Cursor's agent mode scores noticeably lower than the specialized coding agent tools on SWE-bench. This doesn't mean Cursor is worse to use day-to-day; it reflects that Cursor is optimized for interactive development rather than autonomous bug fixing. The benchmark rewards autonomous completion; Cursor's design includes more human-in-the-loop interaction.
GitHub Copilot Workspace: ~44% Copilot Workspace is newer and at a lower autonomy level. It proposes plans and requires developer approval at each stage. The lower SWE-bench score reflects the scaffolding approach more than underlying model capability.
Smaller/cheaper tiers (Claude 4 Sonnet, GPT-4o based agents): typically 45-60% The 10-20 percentage point gap between Sonnet-class and Opus-class models on SWE-bench is real and significant for complex bug-fixing tasks. For simpler coding assistance (autocomplete, function generation, code explanation), the gap is much smaller.
What the scores actually predict
Higher SWE-bench scores correlate with better performance on:
- Multi-file bug fixes in existing codebases.
- Tasks with a clear right answer (tests either pass or don't).
- Python-heavy environments (most of SWE-bench is Python).
They don't predict performance on:
- Writing new applications from scratch.
- Languages other than Python (though cross-language performance correlates with overall model quality).
- Architectural and design decisions where "correct" isn't binary.
- Interactive, iterative development sessions with a developer.
- Tasks requiring visual or UI understanding.
- Long-running tasks spanning multiple days and sessions.
One thing I've noticed: agents that score well on SWE-bench have often optimized their scaffolds specifically for the benchmark format (clear issue description, existing tests, Python repo). When you take the same agent and give it a task where the requirements are vague, there's no existing test suite, or the codebase is messy, the performance degrades more steeply for some tools than others.
The leaderboard isn't the whole story
Three factors that matter for real-world utility but aren't captured by SWE-bench:
Speed and latency. A tool that scores 72% but takes 8 minutes per task may be less useful in practice than one scoring 62% that completes tasks in 90 seconds. Benchmark scores don't include timing. Claude Code on complex SWE-bench tasks can take 3-15 minutes. Devin's more complex architecture can take longer. For interactive use, the difference between a 2-minute and 10-minute completion feels enormous.
Hallucination and false positives. SWE-bench measures whether the tests pass, not whether the code is correct in a broader sense. An agent can pass the tests by adding a special case for the exact test input rather than fixing the underlying issue. Some agents are better than others at producing principled fixes rather than test-passing patches. This matters for code quality in production even when SWE-bench gives a passing grade.
Context handling and multi-file coordination. Many SWE-bench tasks involve changes to 1-3 files. Real-world codebases often require changes coordinated across many more files, with understanding of how they interact. Tools that score similarly on SWE-bench can differ significantly on tasks requiring broader codebase understanding.
User experience and workflow integration. This sounds soft, but it's the reason Cursor has millions of users despite a lower SWE-bench score than Claude Code. A tool that fits naturally into how developers actually work, with fast feedback, good context awareness, and IDE integration, gets used. A tool with a better autonomous benchmark score that's awkward to use in a real workflow doesn't.
How to use these benchmarks when choosing a tool
My recommendation: use SWE-bench as a floor, not a ceiling.
If a tool scores below 40% on SWE-bench, its underlying model or scaffold is weak enough that you'll feel it in everyday use. Below a certain threshold, the benchmark is a meaningful signal.
Above 55-60%, the benchmark differences are smaller than the workflow and ergonomics differences. At that point, test the tools on a sample of your actual tasks. Pick a few representative tasks from your real work, not benchmark tasks, and run them on each tool you're considering. The scores on your specific use case are the relevant numbers.
Also worth noting: these scores are a snapshot. The gap between Devin 2.0 at 65% and Claude Code at 72% is meaningful today, but both will improve. Any comparison from a few months ago is already somewhat stale. If you're making a long-term platform decision, evaluate the trajectory and roadmap as much as the current score.
Where to find current scores
The official SWE-bench leaderboard at swebench.com is the canonical source. Check dates on any article citing scores; the leaderboard changes frequently. Anthropic, Cognition, and OpenAI also publish their own evaluation results, which should match the leaderboard but may include internal evaluations on updated models not yet submitted to the public leaderboard.
For Aider specifically, the project maintains detailed benchmark results at aider.chat/docs/leaderboards including breakdowns by model and scaffold variant. This is one of the most transparent self-reporting efforts in the space.