Anthropic's Fable 5 Crushed GPT-5.5 on Benchmarks. Then Regulators Pulled It.

Anthropic’s Fable 5 Crushed GPT-5.5 on Benchmarks. Then Regulators Pulled It.

Anthropic's Fable 5 led every major benchmark against GPT-5.5 with 80.3% on SWE-Bench Pro versus 58.6% and topped leaderboards for three days. A US export control order then removed it, leaving GPT-5.5 as the strongest widely available option despite higher costs and narrower capability gaps on some agentic tasks. Enterprises now balance performance against price and policy.

Anthropic dropped Claude Fable 5 on June 9. For three days it sat atop leaderboards. It posted scores that left OpenAI’s GPT-5.5 well behind on software engineering, agentic tasks and complex reasoning. Then an export control order from the US government took it offline. Developers who had just begun testing the new Mythos-class model suddenly faced a choice. Revert to GPT-5.5 or earlier Claude versions. The gap they saw was not marginal.

Fable 5 scored 80.3% on SWE-Bench Pro. GPT-5.5 managed 58.6%. The 22-point difference translated into real productivity gains for teams handling large codebases. Stripe compressed a migration across a 50-million-line Ruby codebase from months of manual effort into a single day. Other companies including Cursor, Replit, Figma and Hebbia reported similar jumps in autonomous coding and knowledge work. The model handled repository-level issues, multi-step debugging and long-horizon planning with fewer errors.

But raw numbers only tell part of the story. On FrontierCode Diamond, a test focused on production-quality, maintainable code, Fable 5 reached 29.3%. GPT-5.5 scored 5.7%. The spread widened on GDPval-AA where Fable 5 hit 1932 against GPT-5.5’s 1769. Artificial Analysis gave Fable 5 an Intelligence Index of 65. GPT-5.5 scored 60. It led the Arena leaderboard outright with Claude Opus 4.8 and 4.7 variants in second and third. GPT-5.5 placed fourth.

Benchmark gaps reveal workflow trade-offs

Some tasks favored GPT-5.5. Terminal-Bench results showed narrower differences. GPT-5.5 posted strong marks on interactive terminal commands and certain agentic flows. It also carried a clear price advantage. At $5 per million input tokens and $30 per million output, it ran roughly 35-45% cheaper than Fable 5’s $10 and $50 rates. Enterprises with high-volume, cost-sensitive applications noticed. Speed and consistency in execution mattered too. Several developers reported GPT-5.5 felt more direct on clean algorithmic problems while Fable 5 excelled at planning across messy, interdependent systems.

And the safety debate added another layer. The government cited a jailbreak vulnerability. Anthropic countered that the issue was minor, already public and reproducible on GPT-5.5 without special techniques. The company noted over 1,000 hours of red-teaming found no universal bypasses. More than 95% of Fable 5 sessions stayed self-contained. High-risk queries on cybersecurity, biology or chemistry routed to Opus 4.8 with its tighter safeguards. The model carried a 30-day retention policy and avoided training on user data from Mythos-class interactions.

Real user feedback split along task lines. Reddit threads and developer discussions showed many preferred Fable 5 for embedded work, reverse engineering and complex Rust projects. It caught downstream effects that GPT-5.5 sometimes missed. Others found GPT-5.5 more consistent on execution and less prone to overthinking simple prompts. One tester described Fable 5 as a step-function improvement in agentic performance. Another called the difference noticeable yet not life-changing in ordinary chat.

The brief availability window created unusual market dynamics. Anthropic offered Fable 5 at no extra cost to Pro, Max, Team and Enterprise subscribers through June 22. That promotion lasted only until June 12. Teams that integrated it quickly reported tangible output lifts on long-context vision tasks and persistent memory benchmarks. It beat prior records on Pokémon FireRed with minimal scaffolding and improved spatial reasoning scores. Yet the sudden removal forced many back to GPT-5.5 or Opus 4.8. The latter scored between the two on several coding metrics at 69.2% on SWE-Bench Pro.

Recent coverage underscores the split. VentureBeat detailed the full benchmark table and enterprise testimonials showing Fable 5’s edge on knowledge work and tool use. Artificial Analysis confirmed the Intelligence Index lead and 1 million token context window against GPT-5.5’s 922,000. Even after the suspension, third-party leaderboards kept Fable 5 at the top in cached results. A new Agents’ Last Exam benchmark delivered a surprise upset with GPT-5.5 variants taking first and second while a Fable 5-powered entry placed third. That outcome highlighted how harness design and specific agent scaffolds can flip relative performance.

So pricing, safety policy and regulatory posture now shape access as much as capability. GPT-5.5 remains the practical default for many workflows. Its lower cost and uninterrupted availability matter when budgets tighten. Fable 5 demonstrated what becomes possible when models push further on planning depth and codebase comprehension. The temporary removal leaves an open question. If negotiations restore access under adjusted controls, enterprises gain a sharper tool for software engineering and research. Until then developers weigh trade-offs between raw benchmark dominance and reliable, affordable deployment. The numbers favor one model. Current availability points to the other.

Anthropic’s Fable 5 Crushed GPT-5.5 on Benchmarks. Then Regulators Pulled It.

Notice an error?

Ready to get started?