Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $\beta$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method.
翻译:多模态方法相较于单模态方法展现出全面的优越性。然而,不同模态对任务相关预测的贡献不平衡,持续削弱了经典多模态方法的判别性能。根据对任务相关预测的贡献,模态可区分为主导模态与辅助模态。基准方法提出一种可行的解决方案:在训练过程中增强贡献较小的辅助模态。然而,我们的实证研究对此类行为背后的基本理念提出了质疑,并进一步指出基准方法存在特定缺陷:理论可解释性不足,以及对判别性知识的探索能力有限。为此,我们从因果视角重新审视多模态表征学习,并构建了结构因果模型。基于实证研究,我们决定在考虑辅助模态的同时,捕捉主导模态的判别性知识与预测标签之间的真实因果关系。为此,我们引入了β泛化前门准则。此外,我们提出了一种新颖的网络架构,用于充分探索多模态判别性知识。我们提供了严格的理论分析和多样化的实证评估,以支持所提方法内在机制的有效性。