OpenAI's Latest Models Hallucinate More Than Previous Ones

OpenAI has a problem with its latest reasoning models, with the company struggling to figure out why they hallucinate more.

OpenAI o3 and o4-mini are two of the companies latest AI models, and among its most advanced yet. Unfortunately for the company, while most new AI models make improvements to the hallucination problem, OpenAI’s are bucking that trend by being worse than previous models.

In a technical report, OpenAI says some of it may be because the new models make more claims overall, meaning they get more answers right as well as wrong.

We tested OpenAI o3 and o4-mini against PersonQA, an evaluation that aims to elicit hallucinations. PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers.

We consider two metrics: accuracy (did the model answer the question correctly) and hallucination rate (checking how often the model hallucinated).

The o4-mini model underperforms o1 and o3 on our PersonQA evaluation. This is expected, as smaller models have less world knowledge and tend to hallucinate more. However, we also observed some performance differences comparing o1 and o3. Specifically, o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims. More research is needed to understand the cause of this result.

A Growing Problem for the Industry

Hallucinations represent a growing problem for the industry, one that has no easy answer while simultaneously undermining the billions being invested in AI.

Google CEO Sundar Pichai was on record as early as mid-2023 saying that all AI models hallucinate, and that there were no good answers to the problem yet.

“Yes,” Pichai admitted, saying hallucinations are “expected,” and that “no one in the field has yet solved the hallucination problems. All models do have this as an issue.”

When speaking of efforts to find a solution, Pichai said “it’s a matter of intense debate” and that his team will continue to “make progress.”

Similarly, Apple CEO Tim Cook said he would never claim AI wouldn’t hallucinate when asked what confidence he had that Apple Intelligence would be free of the issue.

“It’s not 100 percent. But I think we have done everything that we know to do, including thinking very deeply about the readiness of the technology in the areas that we’re using it in,” Cook replied. “So I am confident it will be very high quality. But I’d say in all honesty that’s short of 100 percent. I would never claim that it’s 100 percent.”

Analysts Growing Increasingly Concerned

Analysts are growing increasingly concerned about generative AI’s capabilities, with hallucinations featuring prominently in those concerns.

Gartner analyst Arun Chandrasekaran warned that organizations were about to experience the “trough of disillusionment” as a result of unrealistic hype, runaway costs, and seemingly unfixable hallucinations.

“The expectations and hype around GenAI are enormously high,” Chandrasekaran said. “So it’s not that the technology, per se, is bad, but it’s unable to keep up with the high expectations that I think enterprises have because of the enormous hype that’s been created in the market in the last 12 to 18 months.”

“I truly still believe that the long-term impact of GenAI is going to be quite significant, but we may have overestimated, in some sense, what it can do in the near term,” Chandrasekaran added.

Chandrasekaran specifically called out “no robust solution to hallucinations.”

With OpenAI’s latest models getting worse with hallucinations, not better, Chandrasekaran’s cautions may prove to be profoundly accurate.

OpenAI’s Latest Models Hallucinate More Than Previous Ones

A Growing Problem for the Industry

Analysts Growing Increasingly Concerned

Notice an error?

Ready to get started?