In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a cross-modal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.
翻译:本文研究多模态共指消解问题,具体聚焦于较长的描述性文本(即叙述)与图像配对的情景。由于细粒度的图像-文本对齐、叙述语言中固有的歧义性以及大规模标注训练集的缺失,该任务面临重大挑战。为解决这些问题,我们提出一种数据高效的半监督方法,利用图像-叙述对在多模态上下文中实现共指消解与叙述定位。该方法在跨模态框架中整合了标注数据与非标注数据的损失函数。实验评估表明,在共指消解和叙述定位任务中,所提方法在定量与定性指标上均显著优于强基线模型。