Human perception integrates multiple modalities, such as vision, hearing, and language, into a unified understanding of the surrounding reality. While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities. In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns $n$ modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the $k$-dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously. GRAM can replace cosine similarity in any downstream method, holding for 2 to $n$ modality and providing more meaningful alignment with respect to previous similarity measures. The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification. The project page, the code, and the pretrained models are available at https://ispamm.github.io/GRAM/.
翻译:人类感知将视觉、听觉和语言等多种模态整合为对周围现实的统一理解。虽然当前多模态模型通过对比学习对齐成对模态取得了显著进展,但这些方法在扩展到多模态场景时存在局限。现有模型通常将各模态与指定锚点对齐,却未能保证所有模态间的相互对齐,导致在需要联合理解多模态的任务中表现欠佳。本文从结构上重新思考传统的成对多模态学习方法,提出新颖的Gramian表示对齐度量(GRAM),以克服上述限制。GRAM通过在模态嵌入所在的高维空间中最小化由模态向量张成的k维平行多胞体(parallelotope)的Gramian体积,直接学习并对齐n个模态,从而同时确保所有模态的几何对齐。GRAM可替代任何下游方法中的余弦相似度,适用于2到n个模态场景,并提供比现有相似度度量更具语义的对齐效果。基于GRAM的新型对比损失函数增强了多模态模型在高维嵌入空间中的对齐能力,在视频-音频-文本检索及音频-视频分类等下游任务中实现了新的最优性能。项目页面、代码与预训练模型公开于https://ispamm.github.io/GRAM/。