Page Loader
Summarize
OpenAI's GPT-4 matches doctors in eye assessments
GPT-4 also surpassed junior doctors and trainee ophthalmologists

OpenAI's GPT-4 matches doctors in eye assessments

Apr 19, 2024
06:20 pm

What's the story

A recent study conducted by the School of Clinical Medicine at the University of Cambridge has found that OpenAI's GPT-4 nearly equals the performance of expert ophthalmologists in eye assessments. The research, which was featured in PLOS Digital Health, tested several learning language models (LLMs) including GPT-4 and its predecessor GPT-3.5, Google's PaLM 2, and Meta's LLaMA using a non-publicly accessible textbook used for training ophthalmologists.

Procedure

AI model and medical professionals tested on eye assessment

The study involved administering a test comprising 87 multiple-choice questions from a textbook used for training ophthalmologists. This test was given to both the learning language models (LLMs) and a group of medical professionals, which included five expert ophthalmologists, three trainee ophthalmologists, and two junior doctors from non-specialized fields. Notably, these LLMs were not believed to have been trained on these specific questions prior to the test.

Test results

GPT-4 surpasses trainees and junior doctors in test

ChatGPT, powered by either GPT-4 or GPT-3.5, was given three attempts to answer each question definitively; otherwise, its response was marked as null. The results were surprising. GPT-4 correctly answered 60 out of the 87 questions, significantly surpassing the junior doctors' average of 37 correct answers and marginally exceeding the trainees' average of 59.7. Interestingly, one expert ophthalmologist only managed to answer 56 questions accurately.

Comparative performance

How expert ophthalmologists and other LLMs performed

Despite GPT-4's impressive performance, the group of five expert ophthalmologists averaged a score of 66.4 correct answers, slightly outdoing GPT-4. Other learning language models (LLMs) like Google's PaLM 2 scored a 49, while GPT-3.5 scored 42. Meta's LLaMA lagged behind with the lowest score at 28, even falling below the junior doctors' scores. These trials were conducted in mid-2023.

Potential pitfalls

Researchers highlight risks and concerns with LLMs

Despite the promising results, the researchers pointed out several risks and concerns associated with learning language models (LLMs). The study was limited in terms of the number of questions, especially in certain categories, suggesting that actual results might differ. LLMs also have a tendency to "hallucinate" or fabricate information, which could lead to false claims about conditions like cataracts or cancer. Furthermore, these systems often lack nuance, creating additional opportunities for inaccuracies.