This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency.
翻译:本文提出了一种新颖的视觉-语言联合任务——基于关系错配的图文匹配对齐(GITM-MR),旨在评估基于Transformer的预训练模型的关系理解能力。GITM-MR要求模型首先判断文本描述是否与图像匹配,然后定位所指对象或对齐文本中的错配部分。我们为该任务构建了预训练模型评估基准,重点关注有限数据和句子长度分布外泛化这两个挑战性场景。评估结果表明预训练模型存在数据效率低和长度泛化能力不足的问题。为解决该问题,我们提出关系敏感对应推理网络(RCRN),该网络通过语言结构引导的双向消息传播实现关系感知推理。RCRN可被解释为模块化程序,在长度泛化和数据效率方面均表现出色。