Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions, evaluating them across nine diverse datasets with up to 23 modalities, and testing their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a straightforward late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analysis indicates that more complex methods perform comparably to SimBaMM and frequently do not reliably outperform well-tuned unimodal baselines, especially in the small-data regime considered in many original studies. To support our findings, we include a case study of a recent multimodal learning method highlighting the methodological shortcomings in the literature. In addition, we provide a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.
翻译:多模态学习的深度学习架构日益复杂,这源于多模态专用方法能提升性能的假设。我们通过一项大规模实证研究对此假设提出挑战:在标准化条件下重新实现了19种高影响力方法,在涵盖多达23种模态的九个多样化数据集上进行评估,并测试它们对超出原始范围的新任务(包括模态缺失场景)的泛化能力。我们提出了一种简单的多模态学习基线方法(SimBaMM),即一种直接的后期融合Transformer架构,并证明在对所有方法进行严格超参数调优的标准化实验条件下,更复杂的架构并不能稳定优于SimBaMM。统计分析表明,更复杂的方法与SimBaMM表现相当,且往往无法稳定超越经过良好调优的单模态基线——这在许多原始研究采用的小数据场景中尤为明显。为佐证研究发现,我们纳入了一项近期多模态学习方法的案例研究,以揭示文献中存在的方法论缺陷。此外,我们提供了一份实用性可靠性检查清单,以促进未来评估的可比性、鲁棒性与可信度。总而言之,我们主张研究重心应实现转变:从追求架构新颖性转向注重方法论严谨性。