The Machines Are Writing the Code Now — And a New Benchmark Finally Measures How Well They Do It

A quiet but significant shift is underway in software development. AI coding assistants — tools like GitHub Copilot, Cursor, and a growing roster of competitors — have moved from novelty to near-ubiquity in engineering teams. But a persistent question has dogged the industry: how do you actually measure whether these tools are any good?

A new project called AI Code Benchmarks, built and maintained by developer Josh Swerdlow, is attempting to answer that question with a degree of rigor that most vendor-published benchmarks lack. The site aggregates and presents results from standardized coding evaluations — SWE-bench, LiveCodeBench, Aider Polyglot, and others — offering a single destination where engineers and engineering leaders can compare AI coding models head-to-head across multiple dimensions of real-world performance.

It’s not a small undertaking. And the timing matters.

The AI-assisted coding market has exploded. GitHub reported in February 2025 that Copilot now has more than 15 million users, a figure that has roughly tripled since early 2024. Cursor, the AI-native code editor built by Anysphere, raised funding at a $900 million valuation in late 2024 according to Reuters, and has since reportedly seen that figure climb further. Amazon’s CodeWhisperer, now rebranded as part of Amazon Q Developer, is being pushed aggressively across AWS’s enterprise customer base. Google’s Gemini models are being embedded directly into development workflows through Firebase and Android Studio.

Money is pouring in. But so is confusion.

Every model provider publishes benchmarks. Anthropic touts Claude’s performance on SWE-bench Verified. OpenAI highlights GPT-4o and o3’s coding capabilities. Google points to Gemini 2.5 Pro’s scores. The problem is that these benchmarks are often cherry-picked, run under favorable conditions, or presented without the context needed to make meaningful comparisons. Swerdlow’s project attempts to normalize this — pulling from established, third-party benchmark suites and presenting results in a way that lets users compare models on equal footing.

The benchmarks tracked on the site tell different stories about different capabilities. SWE-bench, developed by researchers at Princeton, tests whether an AI model can resolve real GitHub issues from popular open-source Python repositories. It’s a demanding test. Models must understand the codebase, identify the relevant files, and produce a working patch. SWE-bench Verified, a curated subset, filters out ambiguous or poorly specified issues, providing a cleaner signal. As of mid-2025, top-performing models on SWE-bench Verified are solving around 60-70% of issues — impressive, but far from complete autonomy.

LiveCodeBench takes a different approach. It draws from competitive programming platforms — LeetCode, Codeforces, AtCoder — using problems posted after the training data cutoffs of the models being tested. This matters enormously. One of the persistent criticisms of AI coding benchmarks is data contamination: the possibility that models have seen the test problems during training. LiveCodeBench sidesteps this by using only recent problems, creating a genuinely out-of-distribution evaluation.

Then there’s the Aider Polyglot benchmark, which measures performance across multiple programming languages — not just Python, which dominates most other evaluations. This is where some models that look dominant in Python-only tests start to show cracks. The real world of software engineering isn’t monolingual. Codebases mix Python, TypeScript, Go, Rust, Java, and more. A model that excels at Python but stumbles on TypeScript has a real limitation that single-language benchmarks won’t reveal.

Swerdlow’s site doesn’t editorialize much. It presents the data. But the data itself tells a story that’s more nuanced than the marketing materials from any single AI company would suggest.

Consider the current state of play. Anthropic’s Claude 3.5 Sonnet and the newer Claude 4 Sonnet have been consistently strong performers across multiple benchmarks, particularly on SWE-bench tasks that require understanding large codebases and producing multi-file patches. OpenAI’s models — particularly the o3 and o4-mini reasoning models — show exceptional strength on competitive programming tasks but sometimes lag on the more practical, real-world issue resolution tasks that SWE-bench measures. Google’s Gemini 2.5 Pro has emerged as a serious competitor, posting strong numbers across the board and particularly excelling in longer-context coding tasks where its large context window provides an advantage.

But no model dominates everything. That’s the real takeaway.

The fragmentation of performance across benchmarks reflects a deeper truth about AI-assisted coding: the task space is enormous. Writing a function to solve an algorithmic puzzle is fundamentally different from debugging a race condition in a distributed system. Generating boilerplate CRUD endpoints is different from refactoring a legacy codebase with poor documentation. No single benchmark captures all of this, which is precisely why aggregation sites like Swerdlow’s have value.

The industry is starting to grapple with this complexity in other ways too. The Wall Street Journal reported earlier this year on the growing anxiety among software engineers about AI’s impact on their profession. Junior developer hiring has slowed at several major tech companies, with some executives privately attributing the shift to productivity gains from AI coding tools. But senior engineers — the ones who can effectively prompt, review, and integrate AI-generated code — are in higher demand than ever.

This dynamic makes benchmark literacy a professional skill. An engineering manager evaluating whether to adopt Cursor versus GitHub Copilot versus a custom integration with Claude’s API needs to understand what the benchmarks actually measure and where they fall short. SWE-bench scores don’t tell you how well a model integrates with your IDE. LiveCodeBench scores don’t predict how effectively a model will handle your company’s proprietary framework. Aider Polyglot numbers don’t capture the experience of pair-programming with an AI assistant in real time.

Context matters. Always.

There’s also the question of cost. The AI Code Benchmarks site tracks performance but doesn’t currently incorporate pricing data — and in the enterprise market, cost-per-token differences between models can translate to hundreds of thousands of dollars annually at scale. OpenAI’s o3 model, for instance, is significantly more expensive per token than GPT-4o. Claude 3.5 Sonnet is cheaper than Claude Opus but often performs comparably on coding tasks. For many organizations, the relevant question isn’t “which model scores highest?” but “which model provides the best performance per dollar for our specific use case?”

The open-source world adds another dimension. Meta’s Llama models, DeepSeek’s coding-focused models, and Alibaba’s Qwen series have all posted competitive benchmark numbers in recent months. The Verge covered Meta’s aggressive push into AI coding tools, noting that Llama models can be self-hosted — eliminating the per-token cost entirely for organizations willing to invest in infrastructure. For companies with strict data privacy requirements, this isn’t just a cost advantage. It’s a prerequisite.

Swerdlow’s project sits at an interesting intersection. It’s not a commercial product. It’s not backed by a model provider. It’s an independent effort to bring transparency to a market that desperately needs it. The site is cleanly designed, regularly updated, and — critically — open about its methodology. Each benchmark’s description includes what it measures, how it’s scored, and what its limitations are.

This kind of independent benchmarking has precedent in other areas of technology. Anandtech and Tom’s Hardware built reputations doing exactly this for CPUs and GPUs. MLPerf became the standard for comparing machine learning hardware. In the AI coding space, the equivalent infrastructure is still being built. Academic benchmarks like SWE-bench provide the raw evaluation frameworks, but the presentation and contextualization layer — making the results accessible and comparable for practitioners — has been largely missing.

That gap is what AI Code Benchmarks fills.

The stakes are high and getting higher. Gartner estimates that by 2028, 75% of enterprise software engineers will use AI coding assistants, up from less than 10% in early 2023. McKinsey’s research suggests that AI tools can improve developer productivity by 20-45% on certain tasks, though the variance is wide and heavily dependent on the nature of the work. These numbers are driving massive enterprise purchasing decisions. CIOs are signing multi-year contracts with AI platform providers. And they’re doing so in an environment where the models themselves are improving — and changing — on a quarterly basis.

A benchmark score from January may be irrelevant by June. A model that leads in one category may be surpassed in the next release cycle. The pace of improvement is relentless, and it makes static, point-in-time evaluations dangerous if treated as permanent truths.

So what should engineering leaders actually do with this information?

First, resist the temptation to optimize for a single benchmark. A model’s SWE-bench score tells you something real, but it doesn’t tell you everything. Test models against your own codebases, your own workflows, your own team’s patterns of interaction. Second, pay attention to the trajectory of improvement, not just the current snapshot. Some model families are improving faster than others, and betting on a model with a strong improvement curve may be smarter than choosing the current leader. Third, treat benchmark aggregation sites like Swerdlow’s as starting points for evaluation, not endpoints.

The AI coding market is moving fast. Too fast for any single evaluation to remain definitive for long. But projects like AI Code Benchmarks provide something valuable: a common frame of reference. A place where the claims of model providers can be checked against standardized, independent evaluations. In a market flooded with hype and vendor-driven narratives, that’s not a luxury. It’s a necessity.

And for the engineers writing code alongside these AI tools every day, understanding the strengths and weaknesses of the models they depend on isn’t academic. It’s the difference between a tool that accelerates their work and one that generates plausible-looking bugs they’ll spend hours tracking down.

The machines are writing code now. Knowing which machines write it best — and under what conditions — has become one of the most consequential questions in software engineering.

The Machines Are Writing the Code Now — And a New Benchmark Finally Measures How Well They Do It

Notice an error?

Ready to get started?