In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.
翻译:在多模态学习中,CLIP已成为将不同模态映射到共享潜在空间的事实标准方法,其通过拉近语义相似的表征同时推远不相似的表征来实现这一目标。然而,基于CLIP的对比损失表现出一些非预期的行为,这些行为对真实的语义对齐产生负面影响,导致潜在空间稀疏且碎片化。这一现象被称为模态鸿沟,在标准的文本-图像对中已得到部分缓解,但在更复杂的多模态场景(如医学领域)中仍未被认知且未获解决。在本研究中,我们针对后一种情况探究了该现象,揭示了模态鸿沟同样存在于医学对齐任务中,并提出了一个与模态无关的框架来弥合这一鸿沟,确保语义相关的表征无论其源模态如何都能实现更紧密的对齐。我们的方法增强了放射学图像与临床文本之间的对齐,从而提升了跨模态检索与图像描述生成的性能。