Human language users can generate descriptions of perceptual concepts beyond instance-level representations and also use such descriptions to learn provisional class-level representations. However, the ability of computational models to learn and operate with class representations is under-investigated in the language-and-vision field. In this paper, we train separate neural networks to generate and interpret class-level descriptions. We then use the zero-shot classification performance of the interpretation model as a measure of communicative success and class-level conceptual grounding. We investigate the performance of prototype- and exemplar-based neural representations grounded category description. Finally, we show that communicative success reveals performance issues in the generation model that are not captured by traditional intrinsic NLG evaluation metrics, and argue that these issues can be traced to a failure to properly ground language in vision at the class level. We observe that the interpretation model performs better with descriptions that are low in diversity on the class level, possibly indicating a strong reliance on frequently occurring features.
翻译:人类语言使用者能够生成超越实例级表示的感知概念描述,并利用这些描述学习暂定的类别级表征。然而,在语言与视觉领域中,计算模型学习和操作类别表征的能力尚未得到充分研究。本文分别训练独立的神经网络来生成和解读类别级描述。随后,我们以解读模型的零样本分类性能作为沟通成功度与类别级概念基础化的衡量指标。我们探究了基于原型和基于范例的神经表征在基础化类别描述中的表现。最终证明,沟通成功度能揭示生成模型中传统内在自然语言生成评估指标无法捕捉的性能问题,并论证这些问题可追溯至语言在类别级视觉基础化过程中的失效。我们观察到,解读模型在类别级多样性较低的描述上表现更佳,这或许表明其高度依赖频繁出现的特征。