Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

翻译：多模态大语言模型（MLLMs）的兴起在医学影像分析领域掀起了前所未有的应用浪潮。然而，作为最早融入此范式的基本任务之一，医学图像分类揭示了一个引人警醒的现实：尽管在预训练数据和模型参数方面具有压倒性优势，最先进的医学MLLMs在性能上始终逊于传统深度学习模型。这一悖论促使我们进行关键性反思：性能退化究竟源自何处？本文针对14个开源医学MLLM，在三个代表性图像分类数据集上开展了广泛实验。超越表层性能基准测试，我们采用特征探针技术，在整个MLLM流程中逐模块、逐层追踪视觉特征的信息流，从而显式可视化分类信号在何处以及如何被扭曲、稀释或覆盖。作为首次剖析医学MLLM分类性能退化的尝试，我们的发现揭示了四种失效模式：1）视觉表征质量受限，2）连接器投影保真度损失，3）大语言模型推理理解缺陷，以及4）语义映射错位。同时，我们引入量化评分以表征特征演化的健康度，从而实现对不同MLLM与数据集间的原则性比较。此外，我们聚焦于阻碍当前医学MLLMs实现其临床潜力的关键障碍，提供了富有洞见的讨论。我们期望本研究能引发学界反思——从高期望到临床可部署的MLLM，道路依然漫长而曲折。