Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
翻译:场景上下文被认为能促进人类对可见物体的感知。本文研究了上下文在图像中物体指代生成(Referring Expression Generation, REG)中的作用,现有研究通常关注对生成器施加压力的干扰上下文。我们对REG中的场景上下文采用了一种新视角,假设上下文信息可被视为一种资源,使REG模型更具韧性,并促进物体描述(尤其是物体类型)的生成。我们训练并测试了基于Transformer的REG模型,其目标表示被人为添加了不同程度噪声。我们评估了模型视觉上下文的特性如何影响其处理过程和性能。结果表明,即使简单的场景上下文也能使模型在面对扰动时表现出惊人韧性,以至于当目标视觉信息完全缺失时,模型仍能识别所指类型。