Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL's superiority compared to baseline methods. Source code can be found in https://github.com/Xiaohao-Liu/PMRL.
翻译:多模态表示学习旨在通过整合不同类型的数据模态来创建统一的表示空间,以提升多模态理解能力。传统方法通常依赖于成对对比学习,这需要预定义的锚定模态,从而限制了所有模态之间的对齐。近期研究探索了同时对多种模态进行对齐,但仍存在若干挑战,例如固定锚点的局限性以及优化奇异值乘积导致的不稳定性。为解决这些问题,本文提出了一种新型框架——原则性多模态表示学习(PMRL),该框架能在无需锚点依赖的情况下,以更稳定的方式实现多模态的同步对齐。具体而言,基于"完全对齐对应秩为1的Gram矩阵"这一理论洞见,PMRL通过优化表示矩阵的主奇异值,使各模态沿共享主导方向对齐。我们提出了一种基于softmax的损失函数,将奇异值视为logit以优先优化最大奇异值。此外,在主特征向量上施加实例级对比正则化,可保持实例间可分离性并防止表示坍缩。在多项任务上的大量实验表明,PMRL相较于基线方法具有显著优越性。源代码可在https://github.com/Xiaohao-Liu/PMRL获取。