Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.

翻译：尽管多模态领域泛化在提升模型鲁棒性方面日益受到关注，但现有性能提升究竟是算法层面的实质性进展，还是评估协议不统一造成的人为假象，这一问题尚不明确。当前研究呈现碎片化特征，不同工作在数据集、模态配置和实验设置上存在显著差异。此外，现有基准主要聚焦动作识别任务，往往忽略输入损坏、模态缺失和模型可信度等关键现实挑战。这种标准化缺失阻碍了对领域发展水平的可靠评估。为解决该问题，我们提出MMDG-Bench——首个统一且全面的多模态领域泛化基准。该基准在涵盖动作识别、机械故障诊断和情感分析三项不同任务的六个数据集上实现标准化评估。MMDG-Bench包含六种模态组合、九种代表性方法及多种评估设置。除标准准确率外，本基准系统性地评估了损坏鲁棒性、缺失模态泛化能力、误分类检测及分布外检测性能。通过训练总计7,402个神经网络（覆盖95个独特的跨领域任务），MMDG-Bench得出五项关键发现：（1）在公平比较条件下，近期专用多模态领域泛化方法相比ERM基线仅带来边际提升；（2）没有任何单一方法能在不同数据集或模态组合中持续表现最优；（3）与最优性能上限仍存在显著差距，表明多模态领域泛化问题远未解决；（4）三模态融合并非始终优于最强双模态配置；（5）所有评估方法在损坏和缺失模态场景下均出现显著性能退化，部分方法进一步损害了模型可信度。