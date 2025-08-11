Advertise with Us
AIDeveloper

OpenAI’s GPT-5 Underperforms in Real-World Tests Despite Hype

OpenAI's GPT-5, touted for advanced coding and agentic tasks, faces criticism for underperforming in real-world tests, producing erroneous outputs and hallucinations compared to GPT-4o. Despite strong benchmarks, developers report frustration and prefer predecessors. Ultimately, GPT-5 requires refinements to meet practical demands.
OpenAI’s GPT-5 Underperforms in Real-World Tests Despite Hype
Written by Corey Blackwell
Monday, August 11, 2025

OpenAI’s Latest Model Faces Scrutiny

In the fast-evolving world of artificial intelligence, OpenAI’s release of GPT-5 has sparked intense debate among developers and tech experts. Billed as the company’s most advanced model yet, GPT-5 promised breakthroughs in coding and agentic tasks, but early tests reveal a more nuanced picture. According to a hands-on review by David Gewirtz in ZDNet, the model’s coding capabilities fell short of expectations, prompting the tester to revert to GPT-4o for practical work. Gewirtz detailed attempts to generate code for tasks like creating a WordPress plugin and handling data visualization, where GPT-5 produced incomplete or erroneous outputs, requiring multiple iterations to fix.

This sentiment echoes broader feedback from the developer community. On platforms like X, users have expressed frustration, with one post noting that building with the GPT-5 API feels “frustrating af” due to difficulties in identifying failures compared to models like Claude’s Sonnet 3.5 or even GPT-4o. OpenAI’s own announcement on their blog highlighted GPT-5’s strengths in instruction-following and benchmarks like COLLIE and Scale MultiChallenge, yet real-world coding scenarios seem to expose limitations.

Benchmark Wins Versus Practical Shortfalls

Benchmarks paint GPT-5 as a leader, with reports from Vellum.ai showing strong performance in health-related tasks and general metrics. However, when it comes to coding, the model struggles with consistency. In Gewirtz’s ZDNet tests, GPT-5 failed to deliver a functional plugin on the first try, hallucinating features and ignoring specifications, whereas GPT-4o handled similar prompts more reliably. This gap suggests that while GPT-5 excels in controlled evaluations, its application in dynamic coding environments lags.

Comparisons with GPT-4o are particularly telling. A detailed benchmark from Passionfruit pitted GPT-5 against its predecessor, finding marginal improvements in reasoning but underwhelming results in coding efficiency and error rates. Users on X have corroborated this, with posts describing GPT-5’s outputs as “slop with random bolding,” lacking the polish of GPT-4o. Fortune’s coverage in their article notes reduced hallucinations and new “vibe coding” features, yet these innovations don’t fully compensate for practical deficiencies.

Developer Experiences and Industry Implications

Industry insiders are weighing in on these discrepancies. Simon Willison’s blog describes using GPT-5 as a daily driver, praising its steerability but acknowledging coding hiccups in complex scenarios. Similarly, a Hacker News thread linked from Y Combinator discusses the competitive clustering of AI models, with some researchers skeptical of GPT-5’s purported leaps. Posts on X highlight ongoing challenges, such as GPT-5’s “one shot laziness,” where it underperforms without extensive prompting, contrasting with GPT-4o’s robustness.

The pricing strategy adds another layer. TechCrunch reported in their piece that GPT-5’s low costs could ignite a price war, making it attractive despite flaws. Yet, for coders, value lies in reliability. Mashable’s take on vibe coding calls it a “dream come true,” but ZDNet’s tests suggest it’s more aspirational than operational.

Looking Ahead: Refinements and Expectations

OpenAI’s integration with Microsoft, as detailed in Microsoft’s announcement, promises broader access, potentially accelerating fixes through user feedback. Nathan Lambert’s analysis on Interconnects argues that while GPT-5 underdelivers on hype, its foundational advancements are phenomenal. X users, however, remain cautious, with some sticking to GPT-4o amid reports of degradation in earlier models.

Ultimately, GPT-5’s coding prowess may improve with updates, but current reviews indicate it’s not yet the game-changer promised. For industry professionals, this underscores the need for rigorous testing beyond benchmarks, ensuring AI tools align with real-world demands. As competition intensifies, OpenAI faces pressure to refine these capabilities swiftly.

Subscribe for Updates

AIDeveloper Newsletter

The AIDeveloper Email Newsletter is your essential resource for the latest in AI development. Whether you're building machine learning models or integrating AI solutions, this newsletter keeps you ahead of the curve.

By signing up for our newsletter you agree to receive content related to ientry.com / webpronews.com and our affiliate partners. For additional information refer to our terms of service.

Notice an error?

Help us improve our content by reporting any issues you find.

Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us
About Us

WebProNews is a leading publisher of business and technology email newsletters and websites.

Reach our audience
Publication Categories
WebProNews is an iEntry Publication
©2025 iEntry, Inc. All rights reserved. Privacy Policy | Legal | Contact Us |