Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at https://github.com/GeWu-Lab/DMIL.
翻译:多模态学习的关键在于捕捉模态间的冗余、独特和协同信息,这些信息共同构成了多模态交互。一个关键但尚未充分探索的挑战是,这些隐式交互在不同样本间动态变化。本文首次提出系统的信息论分析,阐明学习这种动态的、样本特定的交互为何对有效的多模态学习至关重要。我们的分析进一步揭示了传统范式在学习这些不同交互类型方面的缺陷:模态集成方法难以捕捉协同信息,而联合学习范式往往未能充分利用冗余信息。这凸显了需要一种能够在每个样本基础上自适应地从不同交互类型中学习的方法。为此,我们提出基于分解的多模态交互学习(DMIL),这是一种显式建模并利用样本特定交互的新范式。首先,我们设计了一种变分分解架构来分离构成交互的各个组件。其次,我们采用了一种新的学习策略,在微调过程中利用这些显式交互组件,以实现全面的交互学习。跨不同任务和架构的大量实验表明,通过适应整体的样本特定交互,DMIL始终能取得优越的性能。我们的框架灵活且广泛适用,为多模态学习建立了一种以交互为中心的范式。代码可在https://github.com/GeWu-Lab/DMIL获取。