This paper presents a comprehensive evaluation of GPT-4V's capabilities across diverse medical imaging tasks, including Radiology Report Generation, Medical Visual Question Answering (VQA), and Visual Grounding. While prior efforts have explored GPT-4V's performance in medical image anaylsis, to the best of our knowledge, our study represents the first quantitative evaluation on publicly available benchmarks. Our findings highlight GPT-4V's potential in generating descriptive reports for chest X-ray images, particularly when guided by well-structured prompts. Meanwhile, its performance on the MIMIC-CXR dataset benchmark reveals areas for improvement in certain evaluation metrics, such as CIDEr. In the domain of Medical VQA, GPT-4V demonstrates proficiency in distinguishing between question types but falls short of the VQA-RAD benchmark in terms of accuracy. Furthermore, our analysis finds the limitations of conventional evaluation metrics like the BLEU score, advocating for the development of more semantically robust assessment methods. In the field of Visual Grounding, GPT-4V exhibits preliminary promise in recognizing bounding boxes, but its precision is lacking, especially in identifying specific medical organs and signs. Our evaluation underscores the significant potential of GPT-4V in the medical imaging domain, while also emphasizing the need for targeted refinements to fully unlock its capabilities.
翻译:本文对GPT-4V在多种医学影像任务中的能力进行了全面评估,涵盖放射学报告生成、医学视觉问答(VQA)及视觉定位等领域。尽管已有研究探索了GPT-4V在医学影像分析中的表现,但据我们所知,本研究首次在公开基准上对其进行了定量评估。我们的发现凸显了GPT-4V在生成胸部X光片描述性报告方面的潜力,尤其是在结构良好的提示词引导下。同时,其在MIMIC-CXR数据集基准上的表现揭示了某些评估指标(如CIDEr)的改进空间。在医学视觉问答领域,GPT-4V展现出区分问题类型的熟练能力,但在准确率上尚未达到VQA-RAD基准水平。此外,我们的分析指出了BLEU分数等传统评估指标的局限性,并倡导开发更具语义鲁棒性的评估方法。在视觉定位领域,GPT-4V在识别边界框方面展现出初步潜力,但其精度仍显不足,尤其是在识别特定医学器官和体征时。本评估既强调了GPT-4V在医学影像领域的显著潜力,也指出了为充分释放其能力而需进行的针对性优化方向。