Anovate.ai
Back to all articles
ai-agentsbenchmarkscoding-agentsllm-evaluationsoftware-engineeringdeveloper-tools

Beyond SWE-Bench: How to Evaluate AI Coding Agents for Real Codebases

Anovate.aiJun 22, 20267 min

Every few weeks, a new AI model ships with a headline benchmark number. On some public coding benchmarks, top models cluster within a narrow band, making it hard to tell which agent will hold up in a real codebase. On May 26, 2026, Datacurve released DeepSWE, a coding-agent evaluation that surfaces structural problems with existing benchmarks and tries to address them. It is a more rigorous filter, but not the final answer.

Public benchmark scores are only the first of five filters that determine production performance. The execution harness (the software layer that controls how an agent receives tasks, calls tools, and submits results, also called a scaffold), own-codebase testing, review process, and production constraints each shape real-world outcomes, and none of them appear on any leaderboard.

Figure 1. Top models cluster within a narrow band on SWE-Bench Pro but show a much wider spread on DeepSWE. Source: Datacurve

Why Coding Benchmarks Gave a False Sense of Parity

SWE-Bench Pro has been one of the most visible public benchmarks for evaluating coding agents. Several structural issues made the scores less discriminative than they appeared.

Contamination. Because tasks were drawn from public GitHub history, problem statements and solution patterns may have appeared in training data, so rankings partly reflected memorization rather than reasoning.

Grading errors. Datacurve's audit found SWE-Bench Pro's verifiers misgrade outputs at significant rates: 8.5% false positives and 24% false negatives, often because inherited test suites were not designed to grade arbitrary solutions (Datacurve). DeepSWE's hand-written verifiers registered 0.3% and 1.1% respectively.

Reference solution exposure. SWE-Bench Pro's Docker containers shipped the full git history, including the merged fix. Some agent configurations retrieved the reference commit using git log --all, with no mechanism to catch this.

SWE-Bench Pro vs DeepSWE: Structural Differences

DimensionSWE-Bench ProDeepSWE
Task sourcePublic, held-out, and commercial repositoriesWritten from scratch, never merged publicly
Avg. solution size~120 lines across 5 files~668 lines across 7 files
Repository count41 repos (11 public, 12 held-out, 18 commercial)91 repos, 5 languages
Verifier false positive rate8.5%0.3%
Verifier false negative rate24%1.1%
Reference solution accessible to agentYes, via full git historyNo, shallow clone only

What DeepSWE Changes

DeepSWE contains 113 original tasks across 91 open-source repositories in TypeScript, Go, Python, JavaScript, and Rust (Datacurve). Agents must explore the codebase, form a plan, and implement changes across multiple files, which is closer to what software engineering actually looks like.

One finding worth flagging: on DeepSWE, most models wrote and ran their own tests even without being asked. On SWE-Bench Pro, the same models did this far less often, because the prompt template instructed agents not to modify tests (VentureBeat). A single prompt instruction suppressed a behavior that likely improved performance. Your own prompt design can quietly limit what your coding agent does, independent of which model you chose.

The Results: The Spread Matters More Than the Winner

When you reduce contamination, tighten the grading, and remove access to the reference solution, scores separate. The spread between top and bottom models on DeepSWE is roughly 65 percentage points, far wider than the same models showed on SWE-Bench Pro (Datacurve).

DeepSWE Leaderboard

ModelPass RateAvg CostAvg TimeOutput Tokens
GPT-5.5 [xhigh]70% ± 3%$6.6121 min47k
GPT-5.4 [xhigh]56% ± 2%$4.3827 min71k
Claude Opus 4.7 [max]54% ± 5%$18.1939 min103k
Claude Sonnet 4.6 [high]32% ± 2%$5.5242 min76k
GPT-5.4-mini [xhigh]24% ± 3%$2.0833 min135k
Gemini 3.1 Pro10% ± 3%$1.8436 min53k
Gemini 3 Flash5% ± 2%$1.5339 min233k

All models run on mini-swe-agent for consistency. All figures from the Datacurve DeepSWE leaderboard

Cost and time matter as much as pass rate. Claude Opus 4.7 costs nearly three times as much per run as GPT-5.5 and takes almost twice as long, for a lower score. Gemini 3 Flash produces more than four times GPT-5.5's output tokens at a lower pass rate. Even the top configuration fails roughly 30% of tasks on real open-source repositories. On your codebase, that number will be different, and you will not know what it is until you measure it.

The Harness Problem: What Leaderboards Don't Show

Every model in DeepSWE was run through the same scaffold, mini-swe-agent, keeping comparisons controlled. The tradeoff: harness choice has a large effect on performance in practice. A 2026 position paper documented up to 15 percentage points of scaffold-only variation on SWE-bench Verified for the same model (Zhang et al.). Head-to-head comparisons have shown even larger gaps: GPT-5.5 went from 61.5% to 87.2% in a single harness swap (MindStudio). In some evaluations, the spread between harnesses exceeds the spread between models.

Model-harness pairing matters too: GPT-5.5 performs best in OpenAI's native Codex environment, and Claude Opus performs best inside Claude Code.

What Actually Moves Coding Agent Performance

FactorTypical Performance Swing
Switching frontier models, same harness2 to 15 percentage points
Scaffold-only variation, same modelUp to 15 to 25+ percentage points
Harness swap (GPT-5.5: Codex to Cursor)25.7 percentage points

How to Evaluate Coding Agents for Your Own Stack

Public benchmarks are a filter, not a decision. The real evaluation happens on your code, with your harness, measured against outcomes your team actually cares about.

A Practical Evaluation Framework

StepWhat to doWhy it matters
1. Filter with public benchmarksUse DeepSWE to eliminate the bottom of the fieldNarrows candidates without internal evaluation cost
2. Match the harness to your stackTest surviving models on the tools your team usesHarness can swing scores by 15 to 25+ points
3. Run against your own codebaseUse DeepSWE's open-source Pier harness on your own reposPublic tasks don't reflect your architecture or conventions
4. Measure production metricsTrack time to PR, acceptance rate, rework rate, cost per taskThese are the numbers that matter in practice
5. Audit your prompt designCheck whether templates suppress useful agent behaviorsA single instruction can significantly cut self-testing rates

Start with bounded tasks: bug fixes, test generation, dependency upgrades, documentation updates, and small refactors. Avoid giving agents ownership over architecture, security-sensitive code, auth, or payments before evaluation is complete. DeepSWE's Pier harness is open-source, so teams can write their own tasks, define their own verifiers, and run evaluations against private codebases.

Figure 2. The five-stage evaluation flow for selecting the right coding agent for your stack.

The Buying Decision Should Come After Your Own Evaluation

Use public benchmarks to filter, harness-aware comparisons to narrow further, then run the remaining candidates against your own codebase. The buying decision should come after your own model + harness + workflow evaluation.


References

  1. Zhang, Y., et al. Stop Comparing LLM Agents Without Disclosing the Harness. arXiv, 2026. https://arxiv.org/html/2605.23950v1

  2. MindStudio. Agent Harnesses Beat Model Upgrades: 5 Benchmarks That Prove the Harness Is Now the Product. MindStudio Blog, 2026. https://www.mindstudio.ai/blog/agent-harnesses-beat-model-upgrades-5-benchmarks

  3. Schmid, P. The Importance of Agent Harness in 2026. Phil Schmid Blog, 2026. https://www.philschmid.de/agent-harness-2026

  4. Datacurve. DeepSWE Official Leaderboard and Methodology. Datacurve AI, 2026. https://deepswe.datacurve.ai

  5. Datacurve. DeepSWE GitHub Repository: Benchmark Tasks and Pier Harness. GitHub, 2026. https://github.com/datacurve-ai/deep-swe

  6. VentureBeat Staff. DeepSWE Blows Up the AI Coding Leaderboard, Crowns GPT-5.5. VentureBeat, 2026. https://venturebeat.com/technology/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole

  7. SWE-agent Team. mini-swe-agent. GitHub, 2026. https://github.com/SWE-agent/mini-swe-agent

  8. BenchLM.ai. DeepSWE Benchmark 2026. BenchLM, 2026. https://benchlm.ai/benchmarks/deepSwe

  9. Marc0. SWE-Bench Leaderboard. Marc0 Dev, 2026. https://www.marc0.dev/en/leaderboard

  10. BSWEN. What Does SWE-bench Pro Reveal About Agent Scaffold Performance? BSWEN Docs, 2026. https://docs.bswen.com/blog/2026-04-20-swe-bench-pro-agent-scaffold/