Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions. We evaluate them across nine diverse datasets with up to 23 modalities, and test their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analyses show that complex methods perform on par with SimBaMM and often fail to consistently outperform well-tuned unimodal baselines, especially in small-data settings. To support our findings, we include a case study highlighting common methodological shortcomings in the literature followed by a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.
翻译:多模态学习的深度学习架构日益复杂,这源于多模态专用方法能提升性能的假设。我们通过一项大规模实证研究挑战这一假设,在标准化条件下复现了19种高影响力方法。我们在涵盖多达23种模态的九个多样化数据集上评估这些方法,并测试它们对超出原始范围的新任务(包括模态缺失场景)的泛化能力。我们提出一种多模态学习的简单基线(SimBaMM),即一种后期融合Transformer架构,并证明在标准化实验条件及对所有方法进行严格超参数调优的情况下,更复杂的架构并不能稳定超越SimBaMM。统计分析表明,复杂方法的性能与SimBaMM相当,且往往无法持续优于经过充分调优的单模态基线,尤其在数据量较小的场景中。为佐证研究发现,我们纳入一项案例研究,指出文献中常见的方法论缺陷,随后提出一份实用性可靠性清单,以促进未来可比较、稳健且可信的评估。总之,我们主张转变研究重心:从追求架构新颖性转向注重方法论严谨性。