Human language users can generate descriptions of perceptual concepts beyond instance-level representations and also use such descriptions to learn provisional class-level representations. However, the ability of computational models to learn and operate with class representations is under-investigated in the language-and-vision field. In this paper, we train separate neural networks to generate and interpret class-level descriptions. We then use the zero-shot classification performance of the interpretation model as a measure of communicative success and class-level conceptual grounding. We investigate the performance of prototype- and exemplar-based neural representations grounded category description. Finally, we show that communicative success reveals performance issues in the generation model that are not captured by traditional intrinsic NLG evaluation metrics, and argue that these issues can be traced to a failure to properly ground language in vision at the class level. We observe that the interpretation model performs better with descriptions that are low in diversity on the class level, possibly indicating a strong reliance on frequently occurring features.
翻译:人类语言使用者能够生成超越实例层面表示的感知概念描述,并利用这些描述来学习临时类别层面的表征。然而,在语言与视觉领域,计算模型学习并操作类别表征的能力尚未得到充分研究。本文训练了独立的神经网络来生成和解读类别层面的描述,随后以解读模型的零样本分类性能作为沟通成功度和类别层面概念基础化的衡量标准。我们探究了基于原型和示例的神经网络表征在具有根据的类别描述中的表现。最后,研究表明沟通成功度揭示了生成模型中传统内在自然语言生成评估指标无法捕捉的性能问题,并论证这些问题可追溯至模型未能在类别层面将语言与视觉进行恰当基础化。我们观察到,解读模型在类别层面多样性较低的描述上表现更优,这可能表明其对高频特征的强依赖性。