Multimodal relation extraction (MRE) is the task of identifying the semantic relationships between two entities based on the context of the sentence image pair. Existing retrieval-augmented approaches mainly focused on modeling the retrieved textual knowledge, but this may not be able to accurately identify complex relations. To improve the prediction, this research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We further develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities. Extensive experiments and analyses show that the proposed method is able to effectively select and compare evidence across modalities and significantly outperforms state-of-the-art models.
翻译:多模态关系抽取(MRE)是指基于句子-图像对的上下文识别两个实体间语义关系的任务。现有检索增强方法主要聚焦于建模检索到的文本知识,但这可能无法准确识别复杂关系。为提升预测性能,本研究提出基于目标对象、句子和整幅图像检索文本与视觉证据的方法。我们进一步开发了一种创新方法,通过合成对象级、图像级和句子级信息来增强同模态与跨模态间的推理能力。大量实验与分析表明,所提方法能够有效选择并比较跨模态证据,显著优于当前最优模型。