Accuracy of a Vision-Language Model on Challenging Medical Cases

Background: General-purpose large language models that utilize both text and images have not been evaluated on a diverse array of challenging medical cases. Methods: Using 934 cases from the NEJM Image Challenge published between 2005 and 2023, we evaluated the accuracy of the recently released Generative Pre-trained Transformer 4 with Vision model (GPT-4V) compared to human respondents overall and stratified by question difficulty, image type, and skin tone. We further conducted a physician evaluation of GPT-4V on 69 NEJM clinicopathological conferences (CPCs). Analyses were conducted for models utilizing text alone, images alone, and both text and images. Results: GPT-4V achieved an overall accuracy of 61% (95% CI, 58 to 64%) compared to 49% (95% CI, 49 to 50%) for humans. GPT-4V outperformed humans at all levels of difficulty and disagreement, skin tones, and image types; the exception was radiographic images, where performance was equivalent between GPT-4V and human respondents. Longer, more informative captions were associated with improved performance for GPT-4V but similar performance for human respondents. GPT-4V included the correct diagnosis in its differential for 80% (95% CI, 68 to 88%) of CPCs when using text alone, compared to 58% (95% CI, 45 to 70%) of CPCs when using both images and text. Conclusions: GPT-4V outperformed human respondents on challenging medical cases and was able to synthesize information from both images and text, but performance deteriorated when images were added to highly informative text. Overall, our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy may depend heavily on context.

翻译：背景：同时利用文本和图像的通用大语言模型尚未在多样化的疑难医学病例中接受评估。方法：我们采用2005年至2023年间《新英格兰医学杂志》图像挑战赛中的934个病例，评估了新近发布的融合视觉能力的生成式预训练Transformer 4（GPT-4V）模型与人类受访者相比的总体准确性，并按问题难度、图像类型和肤色进行分层分析。此外，我们进一步对GPT-4V在69场《新英格兰医学杂志》临床病理讨论会（CPCs）中进行了医师评估。针对仅使用文本、仅使用图像以及同时使用文本和图像的模型分别进行了分析。结果：GPT-4V的总体准确率为61%（95%置信区间，58%至64%），而人类受访者为49%（95%置信区间，49%至50%）。在所有难度级别、分歧程度、肤色及图像类型中，GPT-4V均优于人类；但放射影像例外，GPT-4V与人类受访者的表现相当。较长且信息更丰富的图注与GPT-4V性能提升相关，但对人类受访者表现无显著影响。当仅使用文本时，GPT-4V在80%（95%置信区间，68%至88%）的CPCs中将正确诊断纳入鉴别诊断，而同时使用图像和文本时该比例为58%（95%置信区间，45%至70%）。结论：GPT-4V在疑难医学病例中优于人类受访者，并能综合图像与文本信息，但在高信息量文本中增加图像时其性能下降。总体而言，我们的结果表明多模态人工智能模型可能有助于医学诊断推理，但其准确性可能高度依赖于具体情境。