Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this work, we address this challenge by exploiting the syntactic structure of language to boost compositional generalization. This paper elevates the importance of syntactic grounding, particularly through attention masking techniques derived from text input parsing. We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem. Our results on grounded compositional generalization underscore the positive impact of dependency parsing across diverse tasks when utilized with Weight Sharing across the Transformer encoder. The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling and provide insights for future research.
翻译:组合泛化——智能模型将组件的理解外推到新颖组合的能力——是人工智能研究中一个基础且具挑战性的难题,尤其在多模态环境下。本研究通过利用语言的句法结构来增强组合泛化,旨在应对这一挑战。本文强调了句法基础的重要性,尤其是通过源自文本输入解析的注意力掩码技术。我们引入并评估了在多模态基础问题中使用句法信息的价值。在基于基础组合泛化的实验结果中,当与Transformer编码器中的权重共享相结合时,依存句法分析在多种任务中展现了积极影响。这些成果推动了多模态基础与参数高效建模的最新技术发展,并为未来研究提供了见解。