Beyond SWE-Bench: How to Evaluate AI Coding Agents for Real Codebases · Blog

Every few weeks, a new AI model ships with a headline benchmark number. On some public coding benchmarks, top models cluster within a narrow band, making it hard to tell which agent will hold up in a real codebase. On May 26, 2026, Datacurve released DeepSWE, a coding-agent evaluation that surfaces structural problems with existing benchmarks and tries to address them. It is a more rigorous filter, but not the final answer.

Public benchmark scores are only the first of five filters that determine production performance. The execution harness (the software layer that controls how an agent receives tasks, calls tools, and submits results, also called a scaffold), own-codebase testing, review process, and production constraints each shape real-world outcomes, and none of them appear on any leaderboard.

DeepSWE Score Spread vs SWE-Bench Clustering

Figure 1. Top models cluster within a narrow band on SWE-Bench Pro but show a much wider spread on DeepSWE. Source: Datacurve

Why Coding Benchmarks Gave a False Sense of Parity

SWE-Bench Pro has been one of the most visible public benchmarks for evaluating coding agents. Several structural issues made the scores less discriminative than they appeared.

Contamination. Because tasks were drawn from public GitHub history, problem statements and solution patterns may have appeared in training data, so rankings partly reflected memorization rather than reasoning.

Grading errors. Datacurve's audit found SWE-Bench Pro's verifiers misgrade outputs at significant rates: 8.5% false positives and 24% false negatives, often because inherited test suites were not designed to grade arbitrary solutions (Datacurve). DeepSWE's hand-written verifiers registered 0.3% and 1.1% respectively.

Reference solution exposure. SWE-Bench Pro's Docker containers shipped the full git history, including the merged fix. Some agent configurations retrieved the reference commit using git log --all, with no mechanism to catch this.

SWE-Bench Pro vs DeepSWE: Structural Differences

Dimension	SWE-Bench Pro	DeepSWE
Task source	Public, held-out, and commercial repositories	Written from scratch, never merged publicly
Avg. solution size	~120 lines across 5 files	~668 lines across 7 files
Repository count	41 repos (11 public, 12 held-out, 18 commercial)	91 repos, 5 languages
Verifier false positive rate	8.5%	0.3%
Verifier false negative rate	24%	1.1%
Reference solution accessible to agent	Yes, via full git history	No, shallow clone only

What DeepSWE Changes

DeepSWE contains 113 original tasks across 91 open-source repositories in TypeScript, Go, Python, JavaScript, and Rust (Datacurve). Agents must explore the codebase, form a plan, and implement changes across multiple files, which is closer to what software engineering actually looks like.

One finding worth flagging: on DeepSWE, most models wrote and ran their own tests even without being asked. On SWE-Bench Pro, the same models did this far less often, because the prompt template instructed agents not to modify tests (VentureBeat). A single prompt instruction suppressed a behavior that likely improved performance. Your own prompt design can quietly limit what your coding agent does, independent of which model you chose.

The Results: The Spread Matters More Than the Winner

When you reduce contamination, tighten the grading, and remove access to the reference solution, scores separate. The spread between top and bottom models on DeepSWE is roughly 65 percentage points, far wider than the same models showed on SWE-Bench Pro (Datacurve).

DeepSWE Leaderboard

Model	Pass Rate	Avg Cost	Avg Time	Output Tokens
GPT-5.5 [xhigh]	70% ± 3%	$6.61	21 min	47k
GPT-5.4 [xhigh]	56% ± 2%	$4.38	27 min	71k
Claude Opus 4.7 [max]	54% ± 5%	$18.19	39 min	103k
Claude Sonnet 4.6 [high]	32% ± 2%	$5.52	42 min	76k
GPT-5.4-mini [xhigh]	24% ± 3%	$2.08	33 min	135k
Gemini 3.1 Pro	10% ± 3%	$1.84	36 min	53k
Gemini 3 Flash	5% ± 2%	$1.53	39 min	233k

All models run on mini-swe-agent for consistency. All figures from the Datacurve DeepSWE leaderboard

Cost and time matter as much as pass rate. Claude Opus 4.7 costs nearly three times as much per run as GPT-5.5 and takes almost twice as long, for a lower score. Gemini 3 Flash produces more than four times GPT-5.5's output tokens at a lower pass rate. Even the top configuration fails roughly 30% of tasks on real open-source repositories. On your codebase, that number will be different, and you will not know what it is until you measure it.

The Harness Problem: What Leaderboards Don't Show

Every model in DeepSWE was run through the same scaffold, mini-swe-agent, keeping comparisons controlled. The tradeoff: harness choice has a large effect on performance in practice. A 2026 position paper documented up to 15 percentage points of scaffold-only variation on SWE-bench Verified for the same model (Zhang et al.). Head-to-head comparisons have shown even larger gaps: GPT-5.5 went from 61.5% to 87.2% in a single harness swap (MindStudio). In some evaluations, the spread between harnesses exceeds the spread between models.

Model-harness pairing matters too: GPT-5.5 performs best in OpenAI's native Codex environment, and Claude Opus performs best inside Claude Code.

What Actually Moves Coding Agent Performance

Factor	Typical Performance Swing
Switching frontier models, same harness	2 to 15 percentage points
Scaffold-only variation, same model	Up to 15 to 25+ percentage points
Harness swap (GPT-5.5: Codex to Cursor)	25.7 percentage points

How to Evaluate Coding Agents for Your Own Stack

Public benchmarks are a filter, not a decision. The real evaluation happens on your code, with your harness, measured against outcomes your team actually cares about.

A Practical Evaluation Framework

Step	What to do	Why it matters
1. Filter with public benchmarks	Use DeepSWE to eliminate the bottom of the field	Narrows candidates without internal evaluation cost
2. Match the harness to your stack	Test surviving models on the tools your team uses	Harness can swing scores by 15 to 25+ points
3. Run against your own codebase	Use DeepSWE's open-source Pier harness on your own repos	Public tasks don't reflect your architecture or conventions
4. Measure production metrics	Track time to PR, acceptance rate, rework rate, cost per task	These are the numbers that matter in practice
5. Audit your prompt design	Check whether templates suppress useful agent behaviors	A single instruction can significantly cut self-testing rates

Start with bounded tasks: bug fixes, test generation, dependency upgrades, documentation updates, and small refactors. Avoid giving agents ownership over architecture, security-sensitive code, auth, or payments before evaluation is complete. DeepSWE's Pier harness is open-source, so teams can write their own tasks, define their own verifiers, and run evaluations against private codebases.

Internal Evaluation Pipeline for AI Coding Agents

Figure 2. The five-stage evaluation flow for selecting the right coding agent for your stack.

The Buying Decision Should Come After Your Own Evaluation

Use public benchmarks to filter, harness-aware comparisons to narrow further, then run the remaining candidates against your own codebase. The buying decision should come after your own model + harness + workflow evaluation.

References

Zhang, Y., et al. Stop Comparing LLM Agents Without Disclosing the Harness. arXiv, 2026. https://arxiv.org/html/2605.23950v1
MindStudio. Agent Harnesses Beat Model Upgrades: 5 Benchmarks That Prove the Harness Is Now the Product. MindStudio Blog, 2026. https://www.mindstudio.ai/blog/agent-harnesses-beat-model-upgrades-5-benchmarks
Schmid, P. The Importance of Agent Harness in 2026. Phil Schmid Blog, 2026. https://www.philschmid.de/agent-harness-2026
Datacurve. DeepSWE Official Leaderboard and Methodology. Datacurve AI, 2026. https://deepswe.datacurve.ai
Datacurve. DeepSWE GitHub Repository: Benchmark Tasks and Pier Harness. GitHub, 2026. https://github.com/datacurve-ai/deep-swe
VentureBeat Staff. DeepSWE Blows Up the AI Coding Leaderboard, Crowns GPT-5.5. VentureBeat, 2026. https://venturebeat.com/technology/deepswe-blows-up-the-ai-coding-leaderboard-crowns-gpt-5-5-and-finds-claude-opus-exploiting-a-benchmark-loophole
SWE-agent Team. mini-swe-agent. GitHub, 2026. https://github.com/SWE-agent/mini-swe-agent
BenchLM.ai. DeepSWE Benchmark 2026. BenchLM, 2026. https://benchlm.ai/benchmarks/deepSwe
Marc0. SWE-Bench Leaderboard. Marc0 Dev, 2026. https://www.marc0.dev/en/leaderboard
BSWEN. What Does SWE-bench Pro Reveal About Agent Scaffold Performance? BSWEN Docs, 2026. https://docs.bswen.com/blog/2026-04-20-swe-bench-pro-agent-scaffold/