On the generalization capacity of neural networks during generic multimodal reasoning

The advent of the Transformer has led to the development of large language models (LLM), which appear to demonstrate human-like capabilities. To assess the generality of this class of models and a variety of other base neural network architectures to multimodal domains, we evaluated and compared their capacity for multimodal generalization. We introduce a multimodal question-answer benchmark to evaluate three specific types of out-of-distribution (OOD) generalization performance: distractor generalization (generalization in the presence of distractors), systematic compositional generalization (generalization to new task permutations), and productive compositional generalization (generalization to more complex tasks structures). We found that across model architectures (e.g., RNNs, Transformers, Perceivers, etc.), models with multiple attention layers, or models that leveraged cross-attention mechanisms between input domains, fared better. Our positive results demonstrate that for multimodal distractor and systematic generalization, either cross-modal attention or models with deeper attention layers are key architectural features required to integrate multimodal inputs. On the other hand, neither of these architectural features led to productive generalization, suggesting fundamental limitations of existing architectures for specific types of multimodal generalization. These results demonstrate the strengths and limitations of specific architectural components underlying modern neural models for multimodal reasoning. Finally, we provide Generic COG (gCOG), a configurable benchmark with several multimodal generalization splits, for future studies to explore.

翻译：Transformer的出现推动了大型语言模型（LLM）的发展，这类模型展现出近似人类的认知能力。为评估此类模型及其他基础神经网络架构在多模态领域的通用性，我们对其多模态泛化能力进行了比较分析。我们提出了一个多模态问答基准测试，用于评估三种特定类型的分布外（OOD）泛化性能：干扰物泛化（存在干扰物时的泛化能力）、系统性组合泛化（新任务排列的泛化能力）以及生产性组合泛化（更复杂任务结构的泛化能力）。研究发现，在不同模型架构（如RNN、Transformer、Perceiver等）中，具有多层注意力机制或采用输入域间交叉注意力机制的模型表现更优。我们的正面结果表明，对于多模态干扰物泛化和系统性泛化任务，跨模态注意力或深层注意力层是实现多模态输入整合的关键架构特征。然而，这两种架构特征均未显著提升生产性泛化能力，揭示了现有架构在特定多模态泛化类型上的根本局限性。这些结果展示了现代神经模型中特定架构组件在多模态推理中的优势与不足。最后，我们提出了可配置的泛化基准测试Generic COG（gCOG），包含多种多模态泛化数据集拆分，为后续研究提供支持。