Harvard Study in Science: AI Achieves 67% Diagnostic Accuracy in ER, Surpassing Senior Attending Physicians – But It's Too Early to Replace Doctors

Reported by Synced

Edited by Rhino & Solomon

【Synced Insights】A Harvard study published in Science reveals that in a double-blind comparison involving 76 real emergency room patients, OpenAI's o1 model achieved a diagnostic accuracy of 67%, outperforming human doctors who scored 50%. The treatment plan scores showed an even more staggering gap: 89% versus 34%. However, AI cannot yet see a patient's pallor or pain. The real transformation is not "AI won," but that the ER is moving towards a new paradigm of a tripartite governance: "doctor × patient × AI."

A bombshell has just dropped into the global medical community.

Harvard Medical School, in collaboration with Beth Israel Deaconess Medical Center, published a disconcerting study in Science.

In a real emergency department triage scenario, OpenAI's o1 reasoning model had a diagnostic accuracy of 67%, while two experienced internal medicine attending physicians scored 55% and 50%, respectively.

AI won.

Not on a test, not in an exam, but in a real-life ER.

Even more sobering data followed—in a test for formulating treatment management plans, o1 scored 89%, while human doctors, using traditional resources, had a median score of just 34%.

The gap is not marginal; it's more than double.

This is not self-promotion by an AI company. This is a result led by Harvard Medical School, endorsed by a top academic journal, and confirmed by a double-blind review.

A meaningful quote from the study's corresponding author, Arjun Manrai, who heads the AI lab at Harvard Medical School: "We tested this AI model with almost all our benchmarks, and it surpassed all previous models and the physician baseline."

Thus, a crack in the era has been split open.

76 Real Patients, Zero Pre-Processing, Double-Blind Showdown

The most hardcore part of this research is that it didn't test AI on carefully curated textbook cases. Instead, it threw the rawest, messiest electronic health records (EHRs) from the ER directly at the machine.

The research team randomly selected 76 real patients from the Beth Israel emergency department and compared outputs at three key diagnostic nodes: ER triage (just after the patient's arrival), the ER physician's first assessment, and admission to the hospital or ICU.

At each node, two internal medicine attending physicians and OpenAI's o1 and 4o models simultaneously provided differential diagnoses, listing up to 5 possible diagnoses each.

A crucial detail: The researchers performed zero pre-processing on the data.

The paper explicitly states that the AI model received information identical to what doctors see in the EHR—messy, incomplete, and noisy real-world clinical data.

Then, two other attending physicians conducted a "blind review"—they didn't know which diagnoses came from a human and which from the AI.

The blind review results showed that the reviewers were almost completely unable to distinguish the source of the diagnoses. One physician guessed the AI/human source correctly only 15.2% of the time (selecting "can't tell" 83.6% of the time), and the other was even more extreme at just 3.1% (selecting "can't tell" 94.4% of the time).

In other words, even senior doctors couldn't tell that the machine had generated the AI-authored diagnoses.

Under these stringent conditions, in the ER triage phase—the link with the least information, tightest time constraints, and most critical decisions—the o1 model showed a 67.1% accuracy rate (including exact or very close diagnoses).

The two human doctors scored 55.3% and 50.0%.

As more information gradually became available, everyone's performance improved: by the admission stage, o1's accuracy rose to 81.6%, while the doctors' were 78.9% and 69.7%.

But a gap always persisted, and it was widest during the initial, most information-poor phase.

This is precisely the most frightening discovery—the most critical period in the ER is the "first few minutes." A patient has just been wheeled in, information is fragmented, life and death hang in the balance, and doctors must make judgments under extreme uncertainty.

And it is precisely in this phase that the AI performed most prominently.

ER Doctors Push Back: What Does Comparing Internists to AI Prove?

After the paper's release, an emergency physician named Kristen Panthagani directly challenged it on social media, calling it an "over-hyped interesting study."

Her core contention: the study compared AI against internal medicine attending physicians, not emergency physicians.

"If we're going to compare the clinical capabilities of AI and a doctor, we should at least use doctors from the same specialty. I'm not surprised that a large language model can beat a dermatologist on a neurosurgery subspecialty exam, but that doesn't really tell us anything."

She also pointed out the fundamental logic of emergency medicine: "As an emergency physician seeing a patient for the first time, my primary goal is not to guess the final diagnosis. My primary goal is to figure out if you have a disease that can kill you."

Is this a powerful rebuttal?

Yes. But it's also worth noting that the research paper itself acknowledges this limitation, and the core argument of the paper was never that "AI can replace ER doctors." Instead, it's that "AI's reasoning capability under limited information has reached a level worthy of clinical trials."

Emergency physicians do much more than "guess the illness" on site—they look at a patient's complexion, listen to their breathing, feel the degree of pain, and judge subtle changes in vital signs.

These subtle non-verbal cues are sometimes more important than any test indicator.

An experienced ER doctor walks into a room, glances at the patient, and might have already made 80% of the judgment—an ability called "clinical gestalt." It comes from tens of thousands of real-life consultations and is something no AI can currently replicate.

Manrai himself admitted that his team is studying AI's ability to process images and other non-textual signals, and they "see rapid progress," but there is still a long way to go before clinical deployment.

The Lesson of Hinton's "Prophecy": Radiologists Didn't Lose Jobs, They Got Busier

Speaking of AI replacing doctors, one classic "face-slap" case must be mentioned.

In 2016, the "Godfather of AI" and Nobel laureate Geoffrey Hinton said something that shook the medical world: People should stop training radiologists now. Deep learning will be better than radiologists in five years; it's just completely obvious.

This statement scared off many medical students who were ready to choose radiology. Throughout the late 2010s, the media was flooded with articles about the "imminent extinction of radiology."

A decade has passed.

The radiology team at the Mayo Clinic has grown 55% since 2016, reaching 400 people. The American College of Radiology predicts a 26% increase in the supply of radiologists over the next 30 years.

The largest global shortage of radiologists is occurring—not because AI took jobs, but because AI has made imaging more convenient, paradoxically spawning more demand.

Hinton himself later admitted he had "spoken too broadly."

He revised his prediction: Future medical image interpretation will be done by a "combination of AI and radiologists," with AI making radiologists "vastly more efficient and improving accuracy at the same time."

There's a profound economic principle in this story—the Jevons paradox: When a technology makes the use of a resource more efficient, the total demand for that resource can actually increase dramatically.

Diagnostic imaging became cheaper and faster, so doctors ordered more tests, and radiologists got busier.

The authors of this new Harvard study have clearly learned from Hinton's lesson.

Corresponding author Manrai stated unequivocally at the press conference: "Our findings do not mean AI replaces physicians, though some companies selling AI medical products may say that."

Co-corresponding author Adam Rodman, who heads AI projects at Beth Israel, was even more blunt: "There is currently no formal accountability framework for AI diagnosis. Patients want humans to guide them through life-and-death decisions, to guide them through hard treatment choices."

It's Not That "AI Won,"

But That Medical Decision-Making Power is Reorganizing

According to a 2026 survey by the American Medical Association (AMA), over 80% of U.S. physicians already use AI in their profession—double the figure from 2023.

17% of physicians use AI for "diagnostic assistance."

A 2025 Elsevier study found that 20% of clinicians are already seeking "second opinions" from large language models.

This Harvard study proves that in the information-sparse, high-stakes ER setting, AI's reasoning capability has surpassed that of human doctors.

These three data points, overlaid together, point to a clear trend: The power structure of medical decision-making is undergoing a fundamental reorganization.

The old ER model was: patient comes in → doctor judges → decision is made.

The future model might become: patient comes in → AI rapidly scans EHR and gives a preliminary judgment → doctor combines clinical observation and AI suggestions to make a decision → patient participates in discussing the treatment plan.

Study author Rodman predicts a future trifurcation: some tasks humans will consistently do better, some tasks AI will consistently do better, and some tasks will require human-machine collaborative augmentation.

This is what researchers term the "physician-patient-AI" tripartite collaborative model.

It sounds a lot like autonomous driving.

Level 2—AI assists human decisions; Level 3—AI takes the lead, human supervises; Level 4—fully automatic in specific scenarios.

Currently, AI in medicine is roughly in a stage between L2 and L3: it can already provide judgments in the "text world" that surpass humans, but in real, multimodal clinical scenarios, it still needs human eyes, ears, and intuition to fill the gaps.

If AI Misdiagnoses, Who is Responsible?

In all the discussions, there's an elephant in the room no one dares to directly confront: Who bears responsibility if AI makes a mistake?

Rodman admitted frankly in an interview with The Guardian that there is currently no formal accountability framework for AI diagnosis.

If a doctor misdiagnoses, there is a mature medical malpractice system—patients can complain, sue, and doctors face the risk to their license.

But if AI gives a wrong suggestion, the doctor adopts it, and the patient is harmed—is it the doctor's responsibility? The AI company's? The hospital's? Or a shared burden among all three?

An even more complex scenario: If AI gives a correct suggestion, but the doctor overrules the AI's judgment, persists in their own wrong diagnosis, and the patient's treatment is delayed—should the doctor then bear additional liability for "ignoring the AI's advice"?

There is also a more insidious risk: over-reliance.

When doctors get used to high-accuracy judgments from AI, will their independent thinking abilities atrophy? Just as GPS has caused many people to lose their ability to navigate autonomously, will AI-assisted diagnosis cause the "muscle" of clinical reasoning to gradually wither in doctors?

No country currently has clear answers to these questions.

References:

https://www.science.org/doi/10.1126/science.adz4433

https://www.harvardmagazine.com/ai/ai-outperforms-doctors-diagnosis-harvard-study

Catch ASI in a Flash

⭐ Like, Forward, and Wow—One Click for All Three! ⭐

Light up the Star to lock in Synced's lightning-fast updates!

Harvard Study in Science: AI Achieves 67% Diagnostic Accuracy in ER, Surpassing Senior Attending Physicians – But It's Too Early to Replace Doctors

Related Articles

分享網址