Human speakers can generate descriptions of perceptual concepts, abstracted from the instance-level. Moreover, such descriptions can be used by other speakers to learn provisional representations of those concepts. Learning and using abstract perceptual concepts is under-investigated in the language-and-vision field. The problem is also highly relevant to the field of representation learning in multi-modal NLP. In this paper, we introduce a framework for testing category-level perceptual grounding in multi-modal language models. In particular, we train separate neural networks to generate and interpret descriptions of visual categories. We measure the communicative success of the two models with the zero-shot classification performance of the interpretation model, which we argue is an indicator of perceptual grounding. Using this framework, we compare the performance of prototype- and exemplar-based representations. Finally, we show that communicative success exposes performance issues in the generation model, not captured by traditional intrinsic NLG evaluation metrics, and argue that these issues stem from a failure to properly ground language in vision at the category level.
翻译:人类说话者能够生成对感知概念的描述,这些描述是从实例层面抽象出来的。此外,其他说话者可以利用这些描述来学习这些概念的临时表征。在语言与视觉领域,学习和使用抽象感知概念的研究尚不充分。这一问题也与多模态自然语言处理中的表征学习领域高度相关。本文中,我们提出一个用于测试多模态语言模型中类别级感知基础框架。具体而言,我们训练独立的神经网络来生成和解释视觉类别的描述。我们通过解释模型的零样本分类性能来衡量这两个模型的交际成功度,并认为这是感知基础的指标。利用这一框架,我们比较了基于原型和基于示例的表征的性能。最后,我们证明交际成功度能够暴露生成模型中的性能问题,而这些问题无法被传统的内在自然语言生成评估指标捕捉到,并认为这些问题源于未能将语言恰当地建立在视觉的类别层面。