This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency.
翻译:本文提出基于关系的图文匹配与失配推理任务(Grounded Image Text Matching with Mismatched Relation, GITM-MR),这是一项新颖的视觉-语言联合任务,用于评估基于Transformer的预训练模型的关系理解能力。GITM-MR要求模型首先判断一段描述是否与图像匹配,然后定位所指对象或解释文本中的失配部分。我们为该任务构建了预训练模型评估基准,重点聚焦于数据有限和句子长度分布外推等具有挑战性的场景。评估结果表明,预训练模型在数据效率和长度泛化能力上存在不足。为此,我们提出关系敏感对应推理网络(Relation-sensitive Correspondence Reasoning Network, RCRN),该网络通过语言结构引导的双向消息传播实现关系感知推理。RCRN可解释为模块化程序,在长度泛化和数据效率方面均展现出卓越性能。