The task of multimodal referring expression comprehension (REC), aiming at localizing an image region described by a natural language expression, has recently received increasing attention within the research comminity. In this paper, we specifically focus on referring expression comprehension with commonsense knowledge (KB-Ref), a task which typically requires reasoning beyond spatial, visual or semantic information. We propose a novel framework for Commonsense Knowledge Enhanced Transformers (CK-Transformer) which effectively integrates commonsense knowledge into the representations of objects in an image, facilitating identification of the target objects referred to by the expressions. We conduct extensive experiments on several benchmarks for the task of KB-Ref. Our results show that the proposed CK-Transformer achieves a new state of the art, with an absolute improvement of 3.14% accuracy over the existing state of the art.
翻译:多模态指代表达式理解(REC)任务旨在定位自然语言描述所对应的图像区域,近年来已受到研究界的广泛关注。本文聚焦于基于常识知识的指代表达式理解(KB-Ref)任务,该任务通常需要超越空间、视觉或语义信息的推理能力。我们提出了一种新颖的常识知识增强型Transformer框架(CK-Transformer),该框架将常识知识有效融入图像中物体的表征,从而促进对表达式所指目标物体的识别。我们在KB-Ref任务的多个基准数据集上进行了广泛实验。结果表明,所提出的CK-Transformer达到了最新最优性能,较现有最优方法准确率绝对提升了3.14%。