Coreference resolution aims to identify words and phrases which refer to same entity in a text, a core task in natural language processing. In this paper, we extend this task to resolving coreferences in long-form narrations of visual scenes. First we introduce a new dataset with annotated coreference chains and their bounding boxes, as most existing image-text datasets only contain short sentences without coreferring expressions or labeled chains. We propose a new technique that learns to identify coreference chains using weak supervision, only from image-text pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over several strong baselines in resolving coreferences. We also show that coreference resolution helps improving grounding narratives in images.
翻译:指代消解旨在识别文本中指代同一实体的词语和短语,这是自然语言处理中的一项核心任务。本文将该任务扩展至长篇幅视觉场景描述中的指代消解。首先,我们引入了一个带有注释指代链及其边界框的新数据集,因为现有大多数图像-文本数据集仅包含短句,缺乏指代表达或标记链。我们提出了一种新方法,仅通过图像-文本对进行弱监督学习,并利用先验语言知识进行正则化,从而实现对指代链的识别。与多个强基线模型相比,我们的方法在指代消解上取得了显著的性能提升。我们还证明,指代消解有助于改进图像中叙述的语义定位。