In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets from 11 modalities (e.g. Microscopy, Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, including sixteen distinct question types. Throughout our evaluations, we devised textual prompts for GPT-4V, directing it to synergize visual and textual information. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics due to its unreliable and suboptimal accuracy in responding to diagnostic medical questions. In addition, we delineate seven unique facets of GPT-4V's behavior in medical VQA, highlighting its constraints within this complex arena. The complete details of our evaluation cases are accessible at https://github.com/ZhilingYan/GPT4V-Medical-Report.
翻译:本文对最先进的多模态大语言模型——即具备视觉能力的GPT-4(GPT-4V)——在视觉问答(VQA)任务中的能力进行了严格评估。我们利用来自11种模态(如显微镜、皮肤镜、X光、CT等)和15个感兴趣器官(脑、肝、肺等)的病理学及放射学数据集,全面评估了GPT-4V在回答与图像配对问题时的熟练程度。我们的数据集涵盖了广泛的医学询问范围,包括16种不同的问题类型。在评估过程中,我们为GPT-4V设计了文本提示,引导其协同处理视觉和文本信息。基于准确率评分的实验得出结论:当前版本的GPT-4V因在回答诊断性医学问题时准确率不可靠且欠佳,不建议用于实际临床诊断。此外,我们描述了GPT-4V在医学VQA中行为的七个独特方面,凸显了其在此复杂领域的局限性。我们评估案例的完整细节可访问 https://github.com/ZhilingYan/GPT4V-Medical-Report 获取。