Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.
翻译:多模态学习通过整合来自多种模态的信息以构建更全面的表征,在推进人工智能系统发展中发挥着关键作用。尽管其重要性不言而喻,当前最先进的模型仍存在严重局限性,阻碍了完全多模态模型的成功开发。现有方法可能无法提供所有参与模态均已有效对齐的明确指标。因此,部分模态可能并未实现对齐,这削弱了模型在下游任务中的有效性——在这些任务中,多种模态本应提供模型未能充分利用的补充信息。本文提出TRIANGLE:三模态神经几何学习,这是一种直接在模态嵌入所张成的高维空间中计算的新型相似性度量方法。TRIANGLE通过三角形面积相似性改进三种模态的联合对齐,无需额外的融合层或成对相似性计算。当将其纳入对比损失函数以替代余弦相似度时,TRIANGLE能显著提升多模态建模性能,同时产生可解释的对齐原理。在视频-文本与音频-文本检索、音视频分类等三模态任务上的广泛评估表明,TRIANGLE在不同数据集上均取得了最先进的成果,将基于余弦相似度方法的Recall@1指标最高提升了9个百分点。