InfMasking：通过对比多模态交互释放协同信息 (InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions)

In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.

翻译：在多模态表征学习中，模态间的协同交互不仅提供互补信息，还通过特定交互模式产生任何单一模态无法独立实现的独特结果。现有方法可能难以有效捕捉完整的协同信息谱，导致在依赖此类交互的关键任务中出现次优性能。这一问题尤为突出，因为协同信息构成了多模态表征的根本价值主张。为应对这一挑战，我们提出InfMasking——一种基于对比学习的协同信息提取方法，通过无限掩码策略增强协同信息。InfMasking在融合过程中随机遮蔽每个模态的大部分特征，仅保留部分信息以创建具有多样化协同模式的表征。随后通过互信息最大化将未掩码的融合表征与掩码表征对齐，从而编码全面的协同信息。这种无限掩码策略通过在训练中向模型展示多样化的部分模态组合，实现了更丰富交互的捕捉。由于无限掩码下的互信息估计计算量过大，我们推导出InfMasking损失函数来近似该计算。通过受控实验，我们证明InfMasking能有效增强模态间的协同信息。在大规模现实数据集评估中，InfMasking在七个基准测试中均达到最先进性能。代码发布于https://github.com/brightest66/InfMasking。