Experienced software developers walked into a rigorous experiment expecting artificial intelligence to slash their task times. Instead, the tools extended completion by 19%, even as participants insisted they felt faster. This counterintuitive outcome from a 2025 study by nonprofit Model Evaluation and Threat Research (METR) has ignited debates among technologists about AI’s true role in professional coding.
The METR trial, detailed in a paper on arXiv and summarized on their blog, recruited 16 developers averaging five years’ experience on massive open-source repositories—over 22,000 stars and a million lines of code each. They tackled 246 real tasks like bug fixes and features from their daily workflows. Half used AI tools such as Cursor Pro paired with Claude 3.5 or 3.7 Sonnet; the other half went unassisted. Developers predicted a 24% speedup from AI beforehand and self-reported 20% gains afterward. Reality diverged sharply.
Joel Becker and Nate Rush, METR researchers leading the effort, were stunned. “The majority of developers who participated in the study noted that even when they get AI outputs that are generally useful… these developers have to spend a lot of time cleaning up the resulting code to make it actually fit for the project,” Rush told Fortune. Time logs and screen recordings captured the drag: prompting, waiting, debugging hallucinations, and retrofitting code into familiar codebases ate more hours than saved.
Perception Gap Fuels Overoptimism
This mismatch between feeling and fact echoes broader patterns. Participants overestimated AI’s edge despite instructions to use it only when helpful. One developer, Philipp Burckhardt, reflected in a blog post: “While I like to believe that my productivity didn’t suffer while using AI for my tasks, it’s not unlikely that it might not have helped me as much as I anticipated or maybe even hampered my efforts.” METR authors noted developers split between normal use, experimentation, and maximal reliance—modes that amplified delays on complex, context-heavy work.
Greater slowdowns hit tasks where developers had deep prior knowledge, per the study. AI struggled with project-specific quirks that veterans navigated intuitively. “Developers have goals other than completing the task as soon as possible,” Becker explained to Reuters, highlighting why many, including authors, keep using tools for a smoother, essay-editing-like flow despite metrics.
Clashing Studies Reveal Context’s Power
METR’s findings jar against upbeat priors. A GitHub-Microsoft experiment saw developers finish a JavaScript HTTP server 55.8% faster with Copilot, per a 2023 arXiv paper. Larger field trials at Microsoft, Accenture, and a Fortune 100 firm yielded 26% more completed tasks via Copilot, as reported in an MIT Sloan analysis—gains strongest for juniors (27-39%) versus seniors (8-13%).
Why the split? METR emphasized real, mature projects versus benchmarks or greenfield tasks. “On the surface, METR’s results seem to contradict other benchmarks… but those often also measure productivity in terms of total lines of code or the number of discrete tasks,” noted Ars Technica. A Qodo study echoed verification overhead undercutting gains, while Denmark’s workforce data showed mere 3% bumps, per The Register.
Cognitive costs compound: Developers spent 34.3% of sessions verifying Copilot suggestions alone, per Google DeepMind’s Paige Bailey citing internal visuals on X. Review fatigue and context-switching, flagged in InfoWorld, mirror American Psychological Association findings on task-switching penalties.
Economic Hype Meets Measured Reality
Big forecasts falter under scrutiny. PwC eyed 15% U.S. GDP lift by 2035; Goldman Sachs 25% productivity surge. Yet MIT reported just 5% of 300 AI deployments accelerating revenue rapidly, per Harvard Business Review via Fortune. University of Chicago’s Anders Humlum told Fortune: “In the real world, many tasks are not as easy as just typing into ChatGPT. Many experts have a lot of experience [they’ve] accumulated that is highly beneficial.”
METR cautioned limits—small sample, novel tools for many (56% Cursor newbies), no juniors or unfamiliar codebases. One veteran with 50+ Cursor hours sped up, hinting at mastery ceilings. Daron Acemoglu at MIT pegs AI aiding just 4.6% of U.S. tasks optimally. X discussions, like Vladimir’s post, contrast Copilot’s 55% lab wins with METR’s 19% drag, questioning team-size stasis amid 1000x engineer hype.
Recent data tempers further. A Science study found U.S. AI-coded functions hit 29% by early 2025, boosting productivity 3.6%—mostly for experts experimenting boldly, per TechXplore. Anthropic trials showed Claude aiding tasks but eroding skills and collaboration.
Workflow Shifts Demand New Guardrails
Industry insiders adapt. “AI coding tools introduced ‘extra cognitive load and context-switching’ that disrupted developer productivity,” per Augment Code on METR. Cerbos urged measuring times, not perceptions, citing a 43-point expectation-reality chasm. Reddit’s r/programming threads lament AI’s subtle errors outpacing fix time versus manual coding.
Training matters: Live Cursor sessions didn’t erase slowdowns, but sustained practice might. Tools evolve—METR tracked AI doubling long-horizon task capacity every seven months. X user @tekbog warned AI dumbs down via gambles; Martin Fowler via Birgitta Böckeler highlighted amplified bad practices and fatigue.
For leaders, lessons crystallize: Pair AI selectively with juniors or novel code; enforce rigorous reviews; track objective metrics. As Reuters put it, gains don’t blanket all scenarios—especially not veterans in their domains. Developers continue Cursor use for joy, not just speed, reshaping priorities beyond raw output.


WebProNews is an iEntry Publication