Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.
翻译:在视觉背景下对自然语言进行推理的关键任务,是将词汇与短语映射至图像区域。然而,即便人们普遍期望当代模型在处理可泛化任务时能实现这种定位,但观察其实际定位过程仍十分复杂。我们提出一个联合研究任务性能与短语定位的框架,并设计三项基准实验探究二者关系。研究结果表明,当代模型在短语定位能力与任务解决能力之间存在不一致性。我们展示了如何通过暴力训练基于短语定位注释来解决该问题,并分析了由此引发的动态机制。代码与数据已开源于 https://github.com/lil-lab/phrase_grounding。