Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems. In this paper, we establish a methodological framework for studying what the effects are - if any - of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between cross-modally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.
翻译:具身性被认为是发展更完整、真正具有语义能力的人工智能系统的关键组成部分。现有文献分为两大阵营:一派认为具身性能够实现质上不同的泛化能力,另一派则认为单模态数据量可以弥补具身性的缺失。支持或反对任一立场的有限实证证据——我们认为这是因为研究具身性及其对自然语言处理系统影响的方法论挑战所致。本文建立了一个方法论框架,用于研究为模型提供比纯文本更丰富的输入源可能产生的效果。其核心在于构建基于不同输入模态训练的模型群体的可比较样本,从而将不同输入源带来的质性效应与可量化的模型性能区分开来。运用该框架进行的实验揭示了跨模态具身模型、跨语言具身模型与非具身模型在行为上的质性差异——我们分别从全局数据集层面和特定词表征层面进行测量,具体取决于其语义的具体程度。