Evaluating the generalisation capabilities of multimodal models based solely on their performance on out-of-distribution data fails to capture their true robustness. This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models, considering architectural design, input perturbations across language and vision modalities, and increased task complexity. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes, raising concerns about overfitting to spurious correlations. By employing this evaluation framework on current Transformer-based multimodal models for robotic manipulation tasks, we uncover limitations and suggest future advancements should focus on architectural and training innovations that better integrate multimodal inputs, enhancing a model's generalisation prowess by prioritising sensitivity to input content over incidental correlations.
翻译:仅基于模型在分布外数据上的表现来评估多模态模型的泛化能力,无法全面反映其真实鲁棒性。本研究提出一个综合性评估框架,系统性地考察指令与输入在此类模型泛化能力中的作用,同时考虑架构设计、跨语言与视觉模态的输入扰动,以及任务复杂度的提升。该框架揭示了多模态模型对极端指令扰动的强韧性及其对观测变化的脆弱性,引发了对模型过拟合于虚假相关性的担忧。通过在现有基于Transformer的多模态机器人操作任务模型上应用此评估框架,我们揭示了现有模型的局限性,并建议未来的研究应聚焦于架构与训练方法的创新,以更好地融合多模态输入,通过提升模型对输入内容的敏感性而非偶然相关性,从而增强其泛化能力。