Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis

This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.

翻译：本研究调查了公开可用的开源医学视觉语言模型（VLMs）所产生的特征表示。尽管医学VLMs预期能够捕捉与诊断相关的特征，但其学习到的表征仍未得到充分探索，且分类准确性等标准评估无法完全揭示它们是否真正获得了具有判别性的、病灶特异性特征。理解这些表征对于揭示医学图像结构及改进医学图像分析中的下游任务至关重要。本研究旨在探究医学VLMs学习到的特征分布，并评估医学专业化的影响。我们分析了一些代表性医学VLMs在多模态病灶分类数据集上提取的多图像模态特征分布，并将这些分布与非医学VLMs进行比较，以评估领域特定的医学训练效果。实验表明，医学VLMs能够提取对医学分类任务有效的判别性特征。此外，研究发现，经过上下文增强（如LLM2CLIP）近期改进的非医学VLMs能产生更精细的特征表示。我们的结果表明，在开发医学VLMs时，增强文本编码器比密集训练医学图像更为关键。值得注意的是，非医学模型尤其容易受到图像上叠加文本字符串所引入的偏差影响。这些发现强调，除了图像中文本信息等背景偏差在推理过程中带来的潜在风险外，还需根据下游任务谨慎考虑模型选择。