AI Outperforms Doctors in Emergency Room Tasks, New Harvard Study Shows

Researchers say the technology could help physicians with triage, diagnosis.

Illustration of diverse healthcare professionals around a central medical cross symbol.

MONTAGE ILLUSTRATION BY NIKO YAITANES/HARVARD MAGAZINE; IMAGES BY ADOBE STOCK

An advanced AI agent has outperformed human physicians on a series of demanding tests that assess the ability to correctly diagnose patient illnesses in clinical settings, a Harvard-led study found. OpenAI’s “o1 preview,” the company’s first model capable of step-by-step reasoning, proved that it could conduct real world triage in emergency rooms, recommend appropriate diagnostic tests, and perform case management tasks at a level that matched or surpassed the ability of even well-trained human doctors.

The study, led by Harvard researchers with collaborators at Stanford and published today in Science, suggests an urgent need for controlled trials of the technology, the authors say, to determine how it can be most effectively deployed.

The researchers threw several different complex tests at o1 preview. They asked the large language model (LLM) to arrive at a patient diagnosis and develop a testing plan. They evaluated its skill in clinical reasoning compared to both experts and generalist physicians. And in a live clinical setting, they assessed the LLM’s performance on 76 emergency room cases in a Boston hospital at three stages: initial triage at arrival, first contact with a physician, and upon admission to the medical floor or intensive care unit.

Evaluations were performed by two doctors who did not know whether the ER assessments had been made by the AI model or by two expert attending physicians. Those reviewers found that o1 preview matched or exceeded expert human performance across each stage. The AI was particularly good at making assessments at the initial triage stage, when there was the least information available, the study notes.

In other tests, the AI model proved especially adept at diagnoses involving rare diseases and complex cases. For instance, AI excelled in an evaluation that involved real scenarios from Massachusetts General Hospital that have been published in The New England Journal of Medicine. These cases “are typically very challenging,” said Arjun Manrai, the senior co-author of the study, speaking to reporters on April 29 via Zoom. “They’re… full of either arcane or distracting matter, and… span many different areas of medicine.”

The performance of AI compared to human experts in such cases has “really shocked a lot of folks,” Manrai said.

Thomas Buckley, a doctoral student at Harvard Medical School who worked on the study, added that the results suggest that o1 preview is achieving nearly optimal diagnosis on this set of challenging cases that have been used as benchmarks for assessing the diagnostic ability of computers since 1959.

On tasks involving what doctors refer to as “management reasoning,” from recommendations for antibiotic use to how to approach goals of care, including end-of-life conversations, o1 preview significantly outpaced previous AI models, and also outperformed humans using conventional aids such as up-to-date Google search, the study found.

“Management reasoning is likely a more complex task than diagnostic reasoning,” explained Peter Brodeur, a clinical fellow at Beth Israel Deaconess Medical Center. “It requires many considerations of not only the objective features of a case, but also subjective factors: what context and situations you’re in, and therefore, it probably doesn’t come as a surprise that a reasoning model performs significantly better at such tasks than humans and ChatGPT-4.”

But Manrai emphasized that the team’s findings do not mean that “AI replaces doctors, despite what some companies [selling AI-based healthcare] are likely to say.”

“I think it does mean that we’re witnessing a really profound change in technology that will reshape medicine,” Manrai said, “and that we need to evaluate this technology now and rigorously conduct prospective clinical trials.”

Manrai also pointed to some important caveats. The study was based entirely on text-based inputs, a domain in which language models excel. But practicing physicians, Manrai said, are evaluating many other forms of information: “They have to listen to the patient, they have to review chest X-ray radiographs, imaging studies, and they have to use lots and lots of other types of data—physiological signals, EKGs, ECGs—in everyday clinical decision making.”

Manrai noted that the team is conducting “parallel studies … looking at the performance of these models on images” and other types of signals, and it is seeing rapidly improving results. Still, he envisions AI models working in partnership with physicians, to help them make better decisions. “AI models can get things wrong” and they “can be sycophantic,” he said. “But they “are also delivering real value and helping patients and doctors today.”

The study’s senior co-author Adam Rodman, an assistant professor at Harvard Medical School who leads the school’s task force for integrating AI into the curriculum, said the study definitively shows that reasoning models of AI can meet the criteria for making diagnoses at the highest levels of human performance.

The results suggest at least two instances in which such models could be especially useful to physicians, said Rodman, who also directs the AI program at Beth Israel Deaconess Medical Center’s Shapiro Center for Education and Research. One is performing triage in emergency rooms, where patients sometimes present with indeterminate symptoms, often accompanied by a large quantity of messy electronic health record data that is full of random noise.

“You can easily imagine how a system that passively ran over the electronic health record could potentially improve quality if it could try to identify diagnostic errors or missed opportunities for diagnosis before they happened,” said Rodman.

The other use case for AI, Rodman said, “is this idea of a second opinion. We know that doctors getting second opinions from their human colleagues generally improves care.” In 2025—a lifetime ago for this rapidly advancing technology—an Elsevier study found that 20 percent of clinicians were already consulting an LLM for second opinions, a number that has surely grown.

Rodman predicts that there are “going to be a subset of tasks that humans do better, and tasks that AI systems consistently do better, and then tasks in which there’s some degree of augmentation or teaming. And as a researcher, I don’t a priori know what that will be.

“What I don’t want to happen,” Rodman concluded, involves what he called “AI doctor companies” trying to either cut doctors out of the loop or minimize clinical supervision. “I do not think that these results support that,” Rodman said. “What these results support is a robust and ambitious research agenda to try to figure out how we can use these technologies to make patients’ lives better.”

Manrai, too, argued that nothing will ever replace the power of human contact. “Ultimately,” he said, “I think humans want humans to guide them through life-or-death decisions, to guide them through challenging treatment” to discuss decisions that affect “their quality of life or how they play with their kids, and what they can do for work.”

Instead, these advances are about having “much better tools to help us.”

Read more articles by Jonathan Shaw

You might also like

The Artemis II Mission Included a Harvard Space Medicine Experiment

Wyss Institute researchers are observing how human bone marrow responds to radiation and microgravity.

Discoveries

Short takes on cutting-edge research

Five Questions with Tien Jiang

How brushing and flossing can protect your heart

Most popular

A New ‘Black Swan’ Musical Cranks Up the Tension

The creative team of the A.R.T.’s new show dish on adapting Darren Aronofsky’s thriller classic from screen to stage.

Martin Nowak Placed on Leave a Second Time

Further links to Jeffrey Epstein surface in newly released files.

Harvard’s Class of 2029 Reflects Shifts in Racial Makeup After Affirmative Action Ends

International students continue to enroll amid political uncertainty; mandatory SATs lead to a drop in applications.

Explore More From Current Issue

Katie Benzan stands on a basketball court holding a ball, with a hoop in the background.

How Women Are Changing the NBA

From coaching staffs to front offices, female leaders are bringing new strategies to men’s basketball.

A glowing orange sun with a star and a trailing gas cloud in space.

A Harvard Astrophysicist Explains the Bizarre Behavior of a Supergiant Star

The dimming and rapid rotation of Betelgeuse may be caused by a hidden companion.

Brick archway with a sandy base, surrounded by wooden planks and boxes in a dim space.

How the American Revolution Freed a Future Abolitionist

Darby Vassall, an enslaved child freed after the Battle of Bunker Hill, dedicated his life to fighting for liberty.