SWE-Bench Verified’s Sudden Fall: How OpenAI Exposed Flaws in AI Coding’s Top Metric

OpenAI ditched SWE-bench Verified after audits revealed 59% flawed tests and widespread contamination across frontier models. Scores measured recall, not skill. They endorse tougher SWE-bench Pro, where top AIs drop to 23-59%. Private evals like GDPVal signal the future.
SWE-Bench Verified’s Sudden Fall: How OpenAI Exposed Flaws in AI Coding’s Top Metric
Written by Juan Vasquez

OpenAI has pulled the plug on SWE-bench Verified. The benchmark, once the gold standard for testing AI coding prowess, now sits discarded. Frontier models hovered around 80% for months. Progress stalled. Why? Contamination. Broken tests.

The announcement came February 23, 2026. OpenAI’s researchers, including Mia Glaese and Olivia Watkins from the Frontier Evals team, laid it bare in a detailed post on their site (OpenAI). They audited 138 problems—27.6% of the dataset—that their o3 model failed across 64 runs. Shocking result: 59.4% had serious flaws. Narrow tests demanded exact function names unmentioned in tasks. Wide tests checked unrelated features pulled from pull requests. Even humans struggled.

Take pylint-dev__pylint-4551. The fix worked. But tests failed on an import for ‘get_annotation’—nowhere in the description. Or sympy__sympy-18199, where tests expected Django 4.1 specifics absent from the issue. OpenAI got six engineers to review each case. No doubt left. “Over 60% of remaining problems are unsolvable,” Glaese explained on the Latent Space podcast (Latent Space). Forty-nine tests too narrow. Twenty-six too broad.

And contamination? Rampant. Problems came from popular open-source Python repos like Django and Astropy—prime training fodder. Give models a task ID and hint. GPT-5.2 spits the gold patch verbatim for django__django-11451. Claude Opus 4.5 recalls file paths and comments from astropy__astropy-13236. Gemini 3 Flash outputs full diffs character-for-character on django__django-11099. All frontier models cheated from memory. Scores tracked training exposure, not skill.

SWE-bench started strong in 2023 from Princeton researchers (arXiv). Real GitHub issues. Failing tests pre-fix. Regression tests post-fix. But artifacts plagued it: environment quirks, ambiguous specs. OpenAI fixed that in 2024. Ninety-three engineers triple-reviewed 1,699 tasks. Down to 500 clean ones: SWE-bench Verified (OpenAI). Scores soared. Claude Opus 4 hit 72.5%. GPT-4.1 at 54.6%. Gemini 2.5 Pro 63.2%, per Decrypt (Decrypt). Labs raced for bragging rights.

But saturation hit. Last six months: 74.9% to 80.9%. Watkins noted, “Progress has kind of stalled… the eval is effectively saturated and highly contaminated.” At high levels, benchmarks test trivia like naming, not engineering. Glaese pushed harder tasks: 1-4 hours, open-ended designs, code maintainability.

Enter SWE-bench Pro. Scale AI’s upgrade (Scale Labs). Diverse repos. GPL licenses deter training scrapes. Human specs keep ambiguity. Average 107 lines across 4 files. Scores plummet. Top models: 70%+ on Verified. Around 23% on Pro public. GPT-5.4 (xHigh) leads at 59.10% now. Claude Opus 4-6 thinking at 51.90%. Gemini 3.1 Pro thinking 46.10%. Private sets harder still—GPT-5 drops to 14.9%.

Industry buzz exploded. HackerRank called it killed by memorization (LinkedIn/HackerRank). The Decoder saw strategy: retire where rivals lead, reset on Pro (The Decoder). OpenAI Developers tweeted: “SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated” (X/OpenAI Devs). Sebastian Raschka confirmed suspicions: numbers made models seem closer than reality.

OpenAI looks ahead. GDPVal: private tasks by experts, human-graded holistically (OpenAI). Ties to Preparedness Framework. Track real impact: workflows automated, products built. No more rigid pass-fail mirages.

But challenges persist. Recent X chatter shows Pro heating up. Claude Opus 4.7 reclaimed top at 64.3%, edging GPT-5.5’s 58.6%. DeepSeek V4 Flash nips at heels for cost-sensitive devs. Benchmarks evolve fast. Contamination looms. Labs cycle evals like seasons. Yesterday’s north star. Today’s scrap.

Software engineers know. Benchmarks never captured production chaos: legacy code, tribal knowledge, deadlines. AI promised autonomy. Verified sold the dream. Pro tests grit. Private evals guard truth. Industry insiders watch closely. Real coding wins hide in deploys, not leaderboards.

Subscribe for Updates

GenAIPro Newsletter

News, updates and trends in generative AI for the Tech and AI leaders and architects.

By signing up for our newsletter you agree to receive content related to ientry.com / webpronews.com and our affiliate partners. For additional information refer to our terms of service.

Notice an error?

Help us improve our content by reporting any issues you find.

Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us