Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources, such as text, images, and audio, to support a variety of downstream tasks. A unified representation across various modalities is particularly important for improving efficiency and performance. Recent binding methods, such as ImageBind (Girdhar et al., 2023), typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space. In this paper, we mathematically analyze the fixed anchor binding methods and uncover notable limitations: (1) over-reliance on the choice of the anchor modality, (2) failure to capture intra-modal information, and (3) failure to account for inter-modal correlation among non-anchored modalities. To address these limitations, we propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor; instead, it employs dynamically adjustable centroid-based anchors generated from all available modalities, resulting in a balanced and rich representation space. We theoretically demonstrate that our method captures three crucial properties of multimodal learning: intra-modal learning, inter-modal learning, and multimodal alignment, while also constructing a robust unified representation across all modalities. Our experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed method, showing that dynamic anchor methods outperform all fixed anchor binding methods as the former captures more nuanced multimodal interactions.
翻译:多模态学习在使机器学习模型能够融合并利用文本、图像和音频等多种数据源以支持各种下游任务方面发挥着至关重要的作用。跨不同模态的统一表征对于提升效率与性能尤为重要。近期的绑定方法,例如ImageBind(Girdhar等人,2023),通常采用固定的锚定模态,将多模态数据对齐到该锚定模态的嵌入空间中。本文通过数学分析揭示了固定锚定绑定方法的显著局限性:(1)过度依赖锚定模态的选择;(2)未能捕获模态内信息;(3)未能考虑非锚定模态间的模态间相关性。为克服这些局限,我们提出CentroBind——一种简单而强大的方法,它摒弃了固定锚点的需求,转而采用基于所有可用模态动态生成的可调整质心锚点,从而构建出一个平衡且丰富的表征空间。我们从理论上证明,我们的方法能够捕获多模态学习的三个关键特性:模态内学习、模态间学习以及多模态对齐,同时还能构建一个跨所有模态的鲁棒统一表征。我们在合成数据集和真实数据集上的实验均证明了所提方法的优越性,表明动态锚点方法能够捕获更细微的多模态交互,从而在所有固定锚点绑定方法中取得更优性能。