OpenAI’s ChatGPT Agent Struggles in ZDNet Tests: Only 1 in 8 Tasks Succeed Due to Hallucinations and Errors

OpenAI's ChatGPT Agent promises autonomous web browsing, data analysis, and task execution. However, ZDNet tests reveal severe unreliability: only 1/8 tasks succeeded, with frequent hallucinations, errors, and fabrications. While potential exists for mundane automation, human oversight is essential to avoid misleading outputs and eroded trust.
OpenAI’s ChatGPT Agent Struggles in ZDNet Tests: Only 1 in 8 Tasks Succeed Due to Hallucinations and Errors
Written by Ryan Gibson

In the constantly changing world of artificial intelligence, OpenAI’s latest offering, ChatGPT Agent, promised to revolutionize productivity by acting as an autonomous digital assistant capable of browsing the web, analyzing data, and executing complex tasks. Launched on July 17, 2025, the tool combines conversational prowess with action-oriented capabilities, drawing from predecessors like Operator and Deep Research. Yet, early tests reveal a stark gap between hype and reality, with the agent often stumbling on accuracy and reliability.

According to a detailed examination published in ZDNet, a tester conducted eight rigorous evaluations of ChatGPT Agent, yielding just one near-perfect outcome amid a sea of errors and fabrications. The tests spanned tasks like data analysis, content creation, and real-time information retrieval, exposing the agent’s tendency to generate “alternative facts”—plausible but incorrect information that could mislead users in professional settings.

The Perils of Over-Reliance on AI Agents

One standout failure involved instructing the agent to compile a report on recent tech trends. Instead of accurately sourcing data, it invented statistics and cited nonexistent studies, echoing concerns raised in posts on X where users lamented similar hallucinations in AI outputs. ZDNet’s analysis highlighted how the agent excelled in a single task: summarizing a straightforward document without external browsing, but faltered dramatically when web interaction was required.

Further scrutiny from Medium articles, such as reviews by The PyCoach and Govindhasamy, praises the agent’s potential for automating mundane work like web browsing and PowerPoint creation. However, these sources also note execution flaws, with one reviewer describing a scenario where the agent browsed sites inefficiently, taking far longer than competitors like Genspark, as echoed in recent X discussions criticizing its speed.

Unpacking the Test Failures and Hallucinations

In ZDNet’s battery of tests, seven out of eight attempts produced unreliable results, including a budgeting exercise where the agent miscalculated figures by ignoring key variables. This aligns with broader AI testing methodologies outlined in another ZDNet piece on how the publication evaluates models in 2025, emphasizing benchmarks for factual accuracy and task completion.

X users, including those sharing real-time experiences, have amplified these issues, with posts decrying ChatGPT Agent’s subpar user experience and overhyped features like presentation generation, which one described as “the worst” in comparative trials. A Medium post by Christie C. lists top features but warns of inconsistencies, such as the agent fabricating data during research tasks, underscoring the need for human oversight.

Industry Implications and Future Directions

The implications for businesses are profound: while ChatGPT Agent can handle simple queries conversationally—as noted in OpenAI’s original introduction of ChatGPT in 2022—it struggles with the autonomy required for agentic roles. ZDNet’s findings suggest that without improvements in reasoning and fact-checking, such tools risk eroding trust, much like earlier critiques from X influencers who pointed out ChatGPT’s failures in basic counting or reasoning tasks.

Comparisons with alternatives, like those in ZDNet’s roundup of top AI chatbots, position rivals such as Claude or Gemini as stronger in reliability. Industry insiders must weigh these limitations against the agent’s strengths in creative ideation, as highlighted in a ClickUp blog on building custom AI agents with ChatGPT.

Path Forward: Balancing Innovation with Caution

Ultimately, ChatGPT Agent represents a bold step toward AI autonomy, but its current iteration underscores persistent challenges in the field. As Medium contributors and X discussions indicate, users are excited yet frustrated, often resorting to explicit instructions to mitigate errors—like one X post detailing a failed transcript analysis that improved only with detailed prompts.

For tech leaders, the lesson is clear: integrate such agents judiciously, verifying outputs to avoid the pitfalls of “alternative facts.” With ongoing advancements, as promised in OpenAI’s roadmap, future versions may bridge these gaps, but for now, human assistants remain irreplaceable. ZDNet’s in-depth testing serves as a crucial benchmark, urging the industry to prioritize accuracy over spectacle in the AI arms race.

Subscribe for Updates

AIDeveloper Newsletter

The AIDeveloper Email Newsletter is your essential resource for the latest in AI development. Whether you're building machine learning models or integrating AI solutions, this newsletter keeps you ahead of the curve.

By signing up for our newsletter you agree to receive content related to ientry.com / webpronews.com and our affiliate partners. For additional information refer to our terms of service.
Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us