Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL's superiority compared to baseline methods. The source code will be publicly available.
翻译:多模态表示学习旨在通过整合多样化的数据模态来创建统一的表示空间,以提升多模态理解能力。传统方法通常依赖于成对对比学习,这种方法需要预定义的锚定模态,从而限制了所有模态之间的对齐。近期研究探索了同时对齐多个模态的方法,但仍面临一些挑战,例如固定锚点带来的限制以及优化奇异值乘积时产生的不稳定性。为解决这些挑战,本文提出基于原则的多模态表示学习(PMRL),这是一种新颖的框架,能够以更稳定的方式实现多模态的无锚依赖同步对齐。具体而言,基于完全对齐对应于秩为1的格拉姆矩阵这一理论洞见,PMRL通过优化表示矩阵的主导奇异值,使各模态沿共享主导方向对齐。我们提出一种基于softmax的损失函数,将奇异值视为对数几率以优先优化最大奇异值。此外,对主导特征向量施加实例级对比正则化,保持了实例间的可分离性并防止表示坍缩。在多种任务上的大量实验证明了PMRL相较于基线方法的优越性。源代码将公开提供。