Beyond the Hype: A dispassionate look at vision-language models in medical scenario

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate in evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristic of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models' ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models' spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models' capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models' capabilities against unharmonised and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code and dataset will be available after the acceptance of this paper.

翻译：近年来，大型视觉-语言模型（LVLMs）在多样化任务中展现出卓越能力，在人工智能领域获得了广泛关注。然而，其在医学等专业领域的性能与可靠性尚未得到充分评估。特别是，现有评估大多集中于基于多模态数据的简单视觉问答（VQA）来评价视觉-语言模型，而忽视了LVLMs的深层特性。本研究提出了RadVUQA——一个新颖的放射学视觉理解与问答基准，旨在全面评估现有LVLMs。RadVUQA主要从五个维度验证LVLMs：1）解剖学理解，评估模型在视觉上识别生物结构的能力；2）多模态理解，涉及模型解读语言与视觉指令以产生预期结果的能力；3）定量与空间推理，评估模型的空间意识以及结合定量分析与视觉、语言信息的熟练程度；4）生理学知识，衡量模型理解器官与系统功能及机制的能力；5）鲁棒性，评估模型在面对不协调数据与合成数据时的性能。结果表明，通用LVLMs与医学专用LVLMs均存在关键缺陷，其多模态理解与定量推理能力薄弱。我们的研究揭示了现有LVLMs与临床医生之间的巨大差距，凸显了开发更鲁棒、更智能的LVLMs的迫切需求。代码与数据集将在本文录用后公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日