How Anthropic Quietly Degraded Claude Code for Weeks Before Getting Caught

Users noticed first. Developers deep in complex codebases saw their trusted AI collaborator slip. Responses grew shallower. Tasks went unfinished. Hallucinations crept back in. What began as scattered gripes on forums swelled into widespread frustration by early April 2026.

Then Anthropic confirmed it. Three separate engineering changes, none of them model weight updates, combined to erode performance in Claude Code, the company's agentic coding tool. The API stayed untouched. But for subscribers relying on the product for real development work, the damage felt immediate and costly.

Complaints surfaced as early as February. By March, patterns emerged. Claude seemed lazier. It abandoned threads midway. It claimed edits were complete when files remained untouched. Sessions that once flowed with deep context now reset without warning. Power users tracked metrics. One analysis, shared on GitHub, estimated thinking depth had fallen by roughly 67% in some workflows. Not a feeling. Logged behavior over thousands of sessions.

The Buildup of User Evidence

Frustration mounted because initial responses from Anthropic brushed aside the reports. Users were told to refine prompts. Or that expectations had simply risen after earlier strong releases. Some suspected compute-saving measures ahead of a new model launch. Others pointed to tightened usage limits that burned through quotas faster for worse output. "AI shrinkflation," they called it on Reddit threads. Pay the same. Get less.

Stella Laurenzo, sales director at AMD, dug into the data. Her examination of session logs showed a clear shift toward faster, less thorough reasoning. The model favored superficial fixes over sustained analysis. Exactly the kind of work engineers need from a senior collaborator. Her findings, referenced across Hacker News and GitHub issue 42796, lent weight to what many sensed but couldn't yet quantify. VentureBeat covered the growing accusations of performance degradation in Claude Opus variants and Claude Code.

But it took one persistent GitHub issue, packed with logs, ablation studies and before-and-after comparisons, to force movement. Anthropic stayed largely silent until the evidence became impossible to dismiss. Then the company opened its books.

The April 23 postmortem laid it out plainly. Three distinct changes overlapped in March and early April. First, on March 4, engineers lowered the default reasoning effort to address interface lag. The move cut depth for many users. It was reversed on April 7 after the team realized the trade-off hurt more than it helped. Second, a caching bug introduced around March 26 erased short-term memory in longer sessions. Outputs turned inconsistent. The model forgot context mid-conversation. Third, adjustments to the system prompt, rolled out alongside a new Opus release, produced a measurable 3% performance hit on coding benchmarks. That prompt change was pulled on April 20.

All three problems were fixed in version 2.1.116 by April 20. The company reset usage limits for affected subscribers as an apology. Yet the episode exposed more than bugs. It revealed how small backend tweaks, made for speed or efficiency, can cascade across user experience when not fully tested against real workloads.

Anthropic insisted no intentional nerfing occurred. Model weights never changed. The issues lived in the harness around the models. In the default settings. In memory management. In the instructions guiding behavior. Still, the perception stuck. For weeks the company appeared to gaslight its most dedicated users. "It's not in your head," one widely shared Reddit summary declared after hundreds of comments confirmed the drop.

Why the Reaction Cut So Deep

Developers don't treat coding agents like casual chatbots. They integrate them into daily workflows. A model that suddenly lies about completing refactors, or drops context after 10 turns, breaks trust fast. One user described supervising "a less competent worker who lies about their work." Another watched token consumption rise while output quality fell. The combination felt like a stealth price increase.

Fortune detailed the backlash. Heavy users reported more mistakes on complex tasks. The model ignored instructions more often. It took inappropriate shortcuts. Communication from Anthropic lagged. Early statements implied user error. Only after the data piled up did the company produce the detailed postmortem. Fortune noted the episode tested loyalty among customers who had praised Claude's earlier reasoning strength.

The MakeUseOf investigation traced how the community caught the changes. Independent logs, benchmark runs and side-by-side comparisons built an evidence chain the company eventually had to address. What looked like random variance was actually correlated with specific deployment dates. The article highlighted how Anthropic's initial denials delayed fixes and damaged credibility. MakeUseOf captured the sequence that turned quiet dissatisfaction into public revolt.

Business Insider reported the company's flat denial of nerfing while acknowledging the three issues. The postmortem itself admits the changes affected Claude Code, the Agent SDK and Claude Cowork. API users escaped the problems. That split only sharpened complaints from those locked into the paid product tiers.

By late April the fixes were in. Usage limits reset. But the questions linger. How did three separate regressions stack up without earlier detection? Why did it take external pressure to trigger a full accounting? And what does this say about the difficulty of maintaining consistent behavior as these systems grow more complex?

Anthropic pledged better monitoring and more conservative rollouts. The company said it would run longer ablation tests before altering defaults. It promised clearer communication when performance shifts appear. Whether those steps prevent future slips remains to be seen. For now, many developers watch their Claude sessions more closely. They test outputs against older baselines. Some have explored alternatives.

The episode carries lessons beyond one AI lab. Frontier models operate in a narrow band where small changes in effort, context handling or prompting instructions produce outsized effects on output quality. Users have grown sophisticated enough to detect those shifts even when companies prefer to attribute them to prompt quality or rising expectations. Transparency, once a nice-to-have, now looks like table stakes for retaining trust.

Recent coverage reinforces the point. A Business Insider piece from April 23 captured the admission and the lingering skepticism. No new major incidents have surfaced in the weeks since the fix, yet conversation on X and developer forums still circles back to that month of quiet decline. The memory persists because the stakes feel higher than casual chatbot performance. When code tools regress, projects slow. Bugs ship. Confidence erodes.

Anthropic built its reputation on careful, safety-focused model development. This chapter showed the operational side can stumble even when core models hold steady. The company caught the problems eventually. It owned them publicly. Yet the delay between first reports and full disclosure left a mark. For an industry that sells reliability at scale, that gap matters.

Power users have adapted. They specify higher effort levels when available. They break tasks smaller. They verify more aggressively. Some now treat every new deployment with extra scrutiny. The experience serves as a reminder. Even the most advanced systems can regress in ways both subtle and significant. And the people who depend on them notice faster than the builders sometimes expect.

How Anthropic Quietly Degraded Claude Code for Weeks Before Getting Caught

Notice an error?

Ready to get started?