Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.
翻译:多模态表征学习旨在构建一个共享嵌入空间,使异构模态在语义上对齐。尽管基于InfoNCE的目标函数取得了显著的实证效果,但其引入的内在冲突会导致模态间产生分布差异。本文中,我们识别出多模态体系中的两种冲突,且两者均随模态数量增加而加剧:(i) 对齐-一致性冲突,即一致性排斥力会削弱成对对齐效果;(ii) 内部对齐冲突,即对齐多个模态会引发竞争性的对齐方向。为解决这些问题,我们提出一种多模态表征中对齐性与一致性的解耦方法,为多模态学习提供无冲突的解决方案,该方法无需任务特定模块即可同时支持判别式与生成式应用场景。我们进一步从理论上证明,该方法可作为多模态分布间全局Hölder散度的有效代理,从而减少模态间的分布差异。在检索任务和UnCLIP式生成任务上的大量实验均显示出持续的性能提升。