GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.
翻译:GPT-4V声称具备强大的多模态能力,引发了利用其自动化撰写放射学报告的兴趣,但目前缺乏系统评估。本研究对GPT-4V在生成放射学报告方面的能力进行了系统性评估,使用的数据集为两个胸部X光报告数据集:MIMIC-CXR和IU X-Ray。我们尝试通过不同的提示策略直接生成报告,发现其在词汇指标和临床效能指标上均表现不佳。为探究其低性能原因,我们将任务分解为两个步骤:1)医学图像推理步骤:从图像预测医学状况标签;2)报告合成步骤:根据(真实)状况生成报告。研究表明,GPT-4V在图像推理方面的表现始终较低,且在不同提示策略下均如此。事实上,无论图像中存在何种真实状况,模型预测的标签分布始终保持不变,这表明模型未能对胸部X光片进行有效解读。即使在报告合成阶段提供真实状况标签,其生成的报告在准确性和自然度方面仍逊于微调后的LLaMA-2模型。综合而言,我们的研究结果对GPT-4V在放射学工作流程中的适用性提出了质疑。