Multi-Modal Relation Extraction (MMRE) aims at identifying the relation between two entities in texts that contain visual clues. Rich visual content is valuable for the MMRE task, but existing works cannot well model finer associations among different modalities, failing to capture the truly helpful visual information and thus limiting relation extraction performance. In this paper, we propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects, so as to mine more helpful information for the task, termed as DGF-PT. We first propose a prompt-based autoregressive encoder, which builds the associations of intra-modal and inter-modal features related to the task, respectively by entity-oriented and object-oriented prefixes. To better integrate helpful visual information, we design a dual-gated fusion module to distinguish the importance of image/objects and further enrich text representations. In addition, a generative decoder is introduced with entity type restriction on relations, better filtering out candidates. Extensive experiments conducted on the benchmark dataset show that our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
翻译:多模态关系抽取(MMRE)旨在识别包含视觉线索的文本中两个实体间的关系。丰富的视觉内容对MMRE任务具有重要价值,但现有方法无法有效建模不同模态间的细粒度关联,难以捕获真正有用的视觉信息,从而限制了关系抽取性能。本文提出一种新颖的MMRE框架DGF-PT,以更好地捕捉文本、实体对及图像/物体间的深层关联,挖掘对任务更有帮助的信息。我们首先提出基于提示的自回归编码器,通过实体导向和物体导向的前缀分别构建与任务相关的模态内和模态间特征关联。为更有效地整合有用视觉信息,我们设计了双门控融合模块来区分图像/物体的重要性,并进一步丰富文本表征。此外,引入带有关系实体类型约束的生成式解码器,以更好地筛选候选结果。在基准数据集上的大量实验表明,即使在少样本场景下,我们的方法相比强竞争模型仍取得了优异性能。