Are Professional Doctors Far Inferior to AI Models? OpenAI Launches Open-Source Medical Benchmark HealthBench, o3 Shows Strongest Performance

OpenAI Launches HealthBench Open-Source Benchmark: A new benchmark designed to better measure the capabilities of AI systems in the healthcare domain

圖片

HealthBench was created in collaboration with 262 practicing physicians across 60 countries, featuring 5,000 real health conversations. Unlike previous narrow benchmarks, HealthBench provides meaningful open-ended evaluations using 48,562 unique physician-written scoring criteria, covering multiple health contexts (e.g., emergency, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication)

圖片

Blog:

https://openai.com/index/healthbench/

Paper:

https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf

Code:

https://github.com/openai/simple-evals

OpenAI's Own Model Evaluation Performance:

o3 performs best overall, scoring over 60%

圖片

圖片

This evaluation particularly focused on


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.