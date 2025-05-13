OpenAI's new dataset evaluates how well AI answers medical questions
What's the story
OpenAI has launched HealthBench, a comprehensive dataset to assess the performance of AI models in answering health-related questions.
Backed by detailed evaluation tools, this open-source resource is touted as a major step forward for AI applications in healthcare.
HealthBench was developed in partnership with 262 doctors across 60 countries and features 5,000 simulated health conversations.
Steps
How are responses graded?
Each AI response is assessed against a guide designed by doctors, with criteria weighted according to medical judgment.
The responses are scored using GPT-4.1, an advanced language model developed by OpenAI.
This collaborative approach guarantees that the dataset is thorough and reflective of various medical perspectives globally.
Comparison
OpenAI's o3 model outperforms competitors in HealthBench
As per HealthBench, OpenAI's o3 reasoning model outperformed its competitors with 60% score.
It was followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%.
The dataset supports responses in 49 languages and covers 26 medical specialties like neurology and ophthalmology, making it a versatile tool for evaluating AI performance in healthcare across different regions and fields.
Example
A look at how HealthBench works
An example shared by OpenAI shows how the dataset can be used to assess an AI model's response to a medical emergency.
Here, the AI was asked what to do when you find an unresponsive neighbor on the floor. The model recommended calling emergency services, checking breathing, and ensuring clear airways.
HealthBench evaluated these responses, marking correct actions and areas of improvement, and gave a score of 77% for the case.