In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.
翻译:在多模态学习中,CLIP已被公认为学习跨多模态共享潜在空间的\textit{事实标准}方法,它将相似的表示彼此靠近,并将不相似的表示推远。尽管基于CLIP的损失函数在语义层面有效地对齐了模态,但由此产生的潜在空间通常仅部分共享,揭示出一种被称为模态鸿沟的结构性失配。尽管解决这一现象的必要性仍存争议,特别是考虑到其对实例级任务(例如检索)影响有限,我们证明其影响反而在组级任务(例如聚类)中极为显著。为支持这一论断,我们引入了一种新方法,旨在双模态设置中持续减小这种差异,并可简单扩展至一般的$n$模态情形。通过广泛的评估,我们证明了这一新颖见解:虽然弥合鸿沟在传统的实例级任务中仅带来边际或不一致的改进,但它能显著提升分组任务的性能。这些发现可能重塑我们对模态鸿沟的理解,突显其在提升需要语义分组的任务性能中的关键作用。