Vision-Language Models (VLMs) exhibit a characteristic "cone effect" in which nonlinear encoders map embeddings into highly concentrated regions of the representation space, contributing to cross-modal separation known as the modality gap. While this phenomenon has been widely observed, its practical impact on supervised multimodal learning -- particularly in medical domains -- remains unclear. In this work, we introduce a lightweight post-hoc mechanism that keeps pretrained VLM encoders frozen while continuously controlling cross-modal separation through a single hyperparameter {λ}. This enables systematic analysis of how the modality gap affects downstream multimodal performance without expensive retraining. We evaluate generalist (CLIP, SigLIP) and medically specialized (BioMedCLIP, MedSigLIP) models across diverse medical and natural datasets in a supervised multimodal settings. Results consistently show that reducing excessive modality gap improves downstream performance, with medical datasets exhibiting stronger sensitivity to gap modulation; however, fully collapsing the gap is not always optimal, and intermediate, task-dependent separation yields the best results. These findings position the modality gap as a tunable property of multimodal representations rather than a quantity that should be universally minimized.
翻译:视觉语言模型(VLMs)展现出独特的"锥形效应",即非线性编码器将嵌入映射到表征空间中高度集中的区域,进而导致被称为"模态间隙"的跨模态分离现象。尽管该现象已被广泛观察,但其对有监督多模态学习的实际影响——尤其在医学领域——仍不明确。本文提出一种轻量级事后调整机制,在保持预训练VLM编码器冻结的同时,通过单一超参数λ持续控制跨模态分离程度。该机制无需昂贵重训练即可系统分析模态间隙对下游多模态性能的影响。我们在有监督多模态场景下,基于通用模型(CLIP、SigLIP)和医学专用模型(BioMedCLIP、MedSigLIP),在多样化的医学与自然数据集上开展评估。结果一致表明:适当缩小过度的模态间隙可提升下游性能,且医学数据集对间隙调控的敏感性更强;然而完全消除模态间隙并非始终最优,中间态的任务依赖式分离方能取得最佳效果。这些发现将模态间隙定性为多模态表征的可调属性,而非需普遍最小化的量化指标。