AI Reasoning Models Now Beat Top Doctors on Rare Diagnoses

OpenAI's o1 models outperformed physicians on rare disease diagnosis and ER reasoning in a landmark Science study. They hit correct or near-correct answers more often than top doctors using only text records. Real-world tests confirm the edge, yet integration challenges remain. Human oversight stays essential.
AI Reasoning Models Now Beat Top Doctors on Rare Diagnoses
Written by Eric Hastings

Doctors have spent years chasing elusive answers for patients with baffling symptoms. A new wave of AI systems just shortened that hunt. In head-to-head tests on some of medicine’s toughest puzzles, these models deliver correct or near-correct diagnoses more often than experienced physicians. And they do it in seconds.

The shift comes from OpenAI’s o1 reasoning models. Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center put them through rigorous exams. Results appeared April 30 in Science. The AI outperformed human doctors across multiple benchmarks. It also held up when fed messy real-world records from a Boston emergency department.

Performance That Stands Out

Consider the numbers. On challenging New England Journal of Medicine Clinicopathological Conference cases, the o1-preview model included the correct diagnosis in its list 78.3% of the time. It nailed the exact or very close answer far more consistently than earlier GPT-4 versions. In one subset of 70 cases, accuracy hit 88.6% versus GPT-4’s 72.9%. Physicians scored lower.

Real ER data told a similar story. The model reviewed 76 actual cases at three stages: initial triage, first doctor encounter, and hospital admission. At triage, with the least information available, it identified the exact or very close diagnosis 67.1% of the time. Two attending physicians managed 55.3% and 50.0%. By admission, the AI reached 81.6%. The humans lagged behind.

But numbers only tell part of the tale. The AI processed electronic health records alone. No images. No direct patient conversation. Still it spotted patterns doctors missed in the moment. One case involved a patient with a pulmonary embolism who initially improved on treatment yet later worsened. The medical team suspected the medication had failed. The model scanned the full record and flagged a history of lupus that explained heart inflammation. Correct.

Similar gains showed up on other tests. In NEJM Healer diagnostic scenarios, the AI earned perfect clinical reasoning scores in 78 out of 80 instances. Residents and attendings scored far lower. On management decisions for complex vignettes, the model posted a median score of 89%. Physicians using conventional resources like search engines and databases scored 34%.

These results build on earlier signals. A February 2026 study highlighted in Medical Xpress showed DeepRare AI correctly identifying rare diseases on its first try 64.4% of the time. Doctors hit 54.6%. The pattern repeats. AI now scans patterns across thousands of rare conditions that no single specialist sees often enough to master.

Yet the technology carries sharp edges. Models excel at sequential, step-by-step logic. Human clinicians juggle multiple uncertain possibilities at once. They update beliefs fluidly as new details emerge. AI can latch onto one strong explanation and struggle when facts shift. One analysis in the TechRadar report noted that even top systems falter when several competing diagnoses must be weighed simultaneously. TechRadar quoted Arya Rao of Harvard Medical School: “When we say clinical reasoning, it doesn’t mean the same thing as model reasoning.”

Experts stress the gap. Arjun Manrai, assistant professor of biomedical informatics at Harvard and co-senior author of the Science paper, captured the moment. “We’re witnessing a really profound change in technology that will reshape medicine,” he said. Adam Rodman, hospitalist at Beth Israel and co-senior author, added perspective from the ER test. “This is the big conclusion for me — it works with the messy real-world data of the emergency department. It works for making diagnoses in the real world.”

David Reich, who reviewed the findings, saw both promise and hard questions. “This paper is a beautiful summary of just how much things have improved. You have something which is quite accurate, possibly ready for prime time.” He quickly added the harder part. “Now the open question is how the heck do you introduce it into clinical workflows in ways that actually improve care?” Outcomes in real medicine often prove more subtle than a single diagnostic label.

Recent coverage echoes the urgency. An NPR report from late April detailed how the AI beat two experienced physicians while using only the limited data available at each decision point. News-Medical on May 4 noted the model generally outperformed baselines across tasks, including a perfect R-IDEA reasoning score in most Healer cases. STAT News framed the work as answering a 1959 challenge in Science on what it would take for machines to outperform humans at diagnosis. The bar has been cleared.

Still, deployment remains distant. The Science authors call for prospective trials. Current tests rely on text. Real care involves images, sounds, body language, and chaotic timing. Liability questions loom. Who decides when to trust the model? Who bears responsibility when it errs? Hospitals already experiment with AI for documentation and basic triage. Diagnostic support sits further out. Yet the performance gap on rare cases grows hard to ignore.

Patients with undiagnosed conditions often wait years. Specialists see common diseases daily but encounter true zebras once or twice in a career. AI draws on vastly broader training data. It spots connections across global literature in moments. One transplant patient in the studies showed subtle signs of life-threatening infection. The model raised suspicion a full day before the clinical team acted.

That speed matters. So does consistency. Human performance varies. Fatigue sets in. Cognitive biases creep forward. The AI delivers the same high-level analysis at 3 a.m. as at 10 a.m. But it cannot sit with a frightened patient. It cannot weigh a family’s values against statistical odds. It cannot replace the physician’s role in synthesis and trust-building.

So the future likely holds partnership. Doctors augmented by rapid, high-accuracy second opinions. Systems that flag overlooked possibilities early. Teams that combine human empathy with machine precision. The Science study marks a turning point. Benchmarks once thought unreachable by computers now fall routinely. Medicine must adapt.

Researchers continue to test newer versions. Performance keeps climbing. Yet every advance sharpens the same core questions. How will hospitals integrate these tools without introducing new errors? How will training change when trainees lean on AI that already outscores attendings on classic cases? How will payment systems reward collaboration between human and machine?

Answers won’t arrive overnight. But the data grows clear. On the hardest diagnostic problems, AI reasoning models have crossed a threshold. They don’t just match specialists. In many measured scenarios, they surpass them. The task ahead lies in turning that capability into safer, faster, more equitable care.

Subscribe for Updates

GenAIPro Newsletter

News, updates and trends in generative AI for the Tech and AI leaders and architects.

By signing up for our newsletter you agree to receive content related to ientry.com / webpronews.com and our affiliate partners. For additional information refer to our terms of service.

Notice an error?

Help us improve our content by reporting any issues you find.

Get the WebProNews newsletter delivered to your inbox

Get the free daily newsletter read by decision makers

Subscribe
Advertise with Us

Ready to get started?

Get our media kit

Advertise with Us