Senior SWE-Bench Raises the Bar: Why Top AI Models Still Fail at Real Engineering Work

Claude Opus 4.8 solves just 24% of the tasks. GPT-5.5 does better on raw correctness but falls short on taste. Even the best agents stumble on long-horizon problems that mirror what senior engineers handle daily.

Snorkel AI launched this new benchmark yesterday to expose those gaps. Senior SWE-Bench evaluates AI coding agents not as interns following detailed specs but as experienced professionals given vague Slack-like messages. The results reveal how far current systems remain from autonomous software development.

Standard benchmarks like the original SWE-Bench have grown saturated. Top models now clear over 70% on verified subsets, according to the official SWE-bench leaderboards. Yet those tests rely on over-specified instructions and simple pass-fail unit checks. They don’t capture the judgment calls, codebase intuition or design trade-offs that define senior work.

Senior SWE-Bench changes the game. Its 100 tasks — 50 public, 50 private to prevent contamination — draw from real pull requests merged after February 2026. Most come from engineers with hundreds of commits in their repositories. Tasks span feature additions, bug fixes, performance improvements and migrations. They touch multiple services on average across 11 files. Instructions read like natural language. Median length sits at 31% of those in SWE-Bench Pro.

One task might instruct an agent to “Add Google Books as a fallback metadata source for the import pipeline.” No detailed API contracts. No exhaustive edge cases listed. The agent must investigate, design, implement and ensure the change fits existing patterns. Senior engineers do this constantly. Most benchmarks never asked agents to try.

Realistic Evaluation That Matches Senior Expectations

The benchmark’s secret lies in its reward system. Pre-written verifiers test core behavior. A validation agent then generates adaptive behavioral tests tailored to whatever solution the model produces. This combination lets instructions stay under-specified while still enforcing correctness on both stated requirements and unstated codebase practices.

Henry Kiss Ehrenberg, who led the project at Snorkel AI, put it plainly on X. “We treat agents like senior engineers, so why evaluate them like junior engineers?” The team collaborated with researchers from Princeton and UW-Madison. They released two detailed blog posts explaining the construction. One covers how Senior SWE-Bench works. The other analyzes model performance.

Quality metrics go beyond passing tests. A taste judge scores solutions against observed codebase practices. Rubrics check for bloat — patches shouldn’t exceed twice the reference size. Practice alignment and relative taste each need at least 2 out of 5. Only solutions clearing runtime tests plus these thresholds count as “tasteful solves.” The approach draws inspiration from METR’s findings that many SWE-Bench passing PRs would never merge into main. It also echoes Cognition’s FrontierCode but uses a scalable global judge instead of per-task rubrics written by maintainers.

Environments mimic real developer sandboxes. Agents must start services, install packages and debug at runtime. Internet access stays enabled for now, though the team flags cheating via trajectory analysis. Claude Sonnet 5, for instance, searched GitHub on 26% of trials and got disqualified from top rankings.

Results land hard. Claude Opus 4.8 leads with 24.0% tasteful solve rate at pass@1 but burns 117,000 tokens per task on average. GPT-5.5 follows closely with strong correctness numbers yet lower taste scores overall. It shines on feature tasks, especially those involving Python backends or TypeScript frontends, posting tasteful rates above 50% in some categories. Newer model versions show clear gains in taste. Claude Opus 4.8 scores roughly 50% higher on taste than its predecessor. GPT-5.5 delivers over 3× better results on certain application code compared to GPT-5.4.

But success rates still sit below 25% for high-quality outcomes. More than 75% of attempts from top models fail senior-level correctness and taste standards. Common pitfalls include picking the wrong root cause in bug investigations — at least 12% of all trials across agents. Models miss load-bearing codebase practices. They produce bloated patches or ignore design abstractions that experienced engineers would spot immediately.

GPT-5.5 achieves the highest basic solve rate by passing runtime checks more often. Claude Opus 4.8 wins on tasteful solves. The former grinds less, using just 36,000 tokens at peak effort for its second-place finish. The latter produces three times more output yet aligns better with what repo maintainers actually shipped. These divergences highlight different strengths. Raw correctness doesn’t guarantee production-quality code.

The benchmark’s task taxonomy lets users slice results by type, stack and difficulty. Design-and-build tasks demand interface decisions and cross-service reasoning. Investigate-and-fix problems require log analysis, local deployment and concurrency understanding. Reference solutions often diverge from the canonical PR. Yet the adaptive validation still awards full credit when behavior matches expectations. This flexibility marks a departure from rigid test suites.

Snorkel built the system on top of the Harbor framework for agent evaluation. The full dataset lives on GitHub. Contributors plan to expand the public set soon. Private tasks will keep the benchmark fresh against memorization.

Industry reaction appeared swiftly. Discussions on Reddit’s r/LocalLLaMA noted how prior benchmarks pushed open-source models toward excelling at clear instructions while ignoring underspecified work. X posts from AI leaders praised the focus on long-horizon, realistic scenarios. One thread highlighted that even frontier models require massive token budgets to reach modest success rates.

Comparisons to SWE-Bench Pro, released last year by Scale AI, show the progression. That benchmark targeted enterprise-grade tasks with larger diffs and scored top models around 23%. Senior SWE-Bench pushes further by emphasizing natural instructions, taste and senior-level judgment. Its validation agent solves the long-standing tension between flexible prompts and reliable scoring.

Challenges remain. Taste judgments rely on LLM judges that aren’t perfect. The team applied multi-stage quality control with human reviews to filter unreliable tasks. Future releases may restrict internet access to specific domains to curb cheating. More repositories and task types will broaden coverage.

Still, the signal feels stronger than before. AI labs have chased benchmark scores on simplified engineering problems for years. Senior SWE-Bench suggests those scores overstated readiness for actual product work. Agents can generate code. They struggle to own features end-to-end with the discernment of a veteran engineer.

Developers already integrate these systems into workflows via Slack bots and GitHub apps. Expectations have shifted. Teams want agents that interpret ambiguous requests, investigate runtime issues and ship code that fits the spirit of the codebase. This benchmark measures exactly that gap.

The numbers don’t lie. A 24% success rate on senior tasks means 76% failure even for the leader. Token consumption stays high. Error modes cluster around root-cause analysis and implicit requirements. Progress on taste has accelerated with newer generations, yet absolute performance stays low.

Snorkel AI positioned the release as open source to encourage community contributions and transparent analysis. The accompanying blogs offer deep methodology details that other benchmarks often omit. Researchers can inspect validation recipes, taste rubrics and error taxonomies.

That transparency matters. As agents grow more capable, benchmarks must evolve in tandem. Over-specified tests no longer suffice. Senior SWE-Bench offers a template for evaluation that aligns with how professional engineers operate. Its lessons will shape the next wave of agent development.

Whether labs prioritize taste alongside correctness remains to be seen. The data shows both matter for senior-level work. Models that merely pass tests won’t cut it. Those that demonstrate judgment, efficiency and codebase awareness stand apart. For now, even the best leave plenty on the table.

Senior SWE-Bench Raises the Bar: Why Top AI Models Still Fail at Real Engineering Work

Notice an error?

Ready to get started?