The inherent challenge of multimodal fusion is to precisely capture the cross-modal correlation and flexibly conduct cross-modal interaction. To fully release the value of each modality and mitigate the influence of low-quality multimodal data, dynamic multimodal fusion emerges as a promising learning paradigm. Despite its widespread use, theoretical justifications in this field are still notably lacking. Can we design a provably robust multimodal fusion method? This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective. We proceed to reveal that several uncertainty estimation solutions are naturally available to achieve robust multimodal fusion. Then a novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness. Extensive experimental results on multiple benchmarks can support our findings.
翻译:多模态融合的内在挑战在于精确捕捉跨模态相关性并灵活进行跨模态交互。为充分释放每种模态的价值并减轻低质量多模态数据的影响,动态多模态融合成为一种颇具前景的学习范式。尽管其应用广泛,该领域的理论支撑仍明显不足。我们能否设计一种可证明具有鲁棒性的多模态融合方法?本文从泛化视角出发,在最流行的多模态融合框架下提供理论解释以回答这一问题。我们进一步揭示,多种不确定性估计方案天然适用于实现鲁棒多模态融合。随后提出一种名为质量感知多模态融合(QMF)的新型多模态融合框架,该框架在分类准确率和模型鲁棒性方面均有性能提升。多个基准数据集上的大量实验结果验证了我们的发现。