This paper presents a comprehensive evaluation of GPT-4V's capabilities across diverse medical imaging tasks, including Radiology Report Generation, Medical Visual Question Answering (VQA), and Visual Grounding. While prior efforts have explored GPT-4V's performance in medical imaging, to the best of our knowledge, our study represents the first quantitative evaluation on publicly available benchmarks. Our findings highlight GPT-4V's potential in generating descriptive reports for chest X-ray images, particularly when guided by well-structured prompts. However, its performance on the MIMIC-CXR dataset benchmark reveals areas for improvement in certain evaluation metrics, such as CIDEr. In the domain of Medical VQA, GPT-4V demonstrates proficiency in distinguishing between question types but falls short of prevailing benchmarks in terms of accuracy. Furthermore, our analysis finds the limitations of conventional evaluation metrics like the BLEU score, advocating for the development of more semantically robust assessment methods. In the field of Visual Grounding, GPT-4V exhibits preliminary promise in recognizing bounding boxes, but its precision is lacking, especially in identifying specific medical organs and signs. Our evaluation underscores the significant potential of GPT-4V in the medical imaging domain, while also emphasizing the need for targeted refinements to fully unlock its capabilities.
翻译:本文对GPT-4V在多种医学影像任务中的能力进行了全面评估,包括放射学报告生成、医学视觉问答(VQA)和视觉定位。尽管已有研究探讨了GPT-4V在医学影像中的表现,但据我们所知,本研究首次在公开基准上对其进行了定量评估。研究结果凸显了GPT-4V在生成胸部X光片描述性报告方面的潜力,尤其是在结构良好的提示引导下。然而,其在MIMIC-CXR数据集基准上的表现显示,某些评估指标(如CIDEr)仍有改进空间。在医学VQA领域,GPT-4V在区分问题类型方面表现出色,但在准确性上未达到当前基准水平。此外,我们的分析揭示了BLEU评分等传统评估指标的局限性,提倡开发更具语义鲁棒性的评估方法。在视觉定位领域,GPT-4V在识别边界框方面展现出初步潜力,但其精度不足,尤其是在识别特定医学器官和体征时。我们的评估强调了GPT-4V在医学影像领域的巨大潜力,同时指出需要进行针对性优化以充分释放其能力。