Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.
翻译:在视觉背景下对自然语言进行推理的任务中,关键环节是将词和短语定位到图像区域。然而,即使通常预期模型通过利于泛化的方式处理任务时会自然产生这种定位能力,但在当代模型中观察这一过程仍较为复杂。我们提出了一个联合研究任务性能与短语定位的框架,并设计了三个基准来探讨二者之间的关系。研究结果表明,当代模型在短语定位能力与任务求解之间表现出不一致性。我们展示了如何通过基于短语定位标注的暴力训练来缓解这一问题,并分析了由此产生的动态机制。代码及资源可在 https://github.com/lil-lab/phrase_grounding 获取。