Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.
翻译:多模态关系抽取对于构建灵活且现实的知识图谱至关重要。近期研究集中于从不同模态中存在的实体对中抽取关系类型,例如一个实体位于文本中,另一个位于图像中。然而,现有方法需要预先给定实体和对象,这既成本高昂又不切实际。为克服这一局限,我们提出了一项新颖的任务——多模态实体-对象关系三元组抽取,其目标是从图文对中抽取所有三元组(实体跨度、关系、对象区域)。为推进此项研究,我们修改了包含21种关系类型的多模态关系抽取数据集MORE,构建了一个包含20,264个三元组的新数据集,平均每个图文对包含5.75个三元组。此外,我们提出了QEOT,一种基于查询并具有选择性注意力机制的模型,以动态探索文本与视觉信息的交互与融合。特别地,所提方法能够通过一组查询同时完成实体抽取、关系分类和对象检测。我们的方法适用于下游应用,并减少了流水线式方法导致的误差累积。大量实验结果表明,我们提出的方法以8.06%的优势超越了现有基线,并达到了最先进的性能水平。