Extracting relational facts from multimodal data is a crucial task in the field of multimedia and knowledge graphs that feeds into widespread real-world applications. The emphasis of recent studies centers on recognizing relational facts in which both entities are present in one modality and supplementary information is used from other modalities. However, such works disregard a substantial amount of multimodal relational facts that arise across different modalities, such as one entity seen in a text and another in an image. In this paper, we propose a new task, namely Multimodal Object-Entity Relation Extraction, which aims to extract "object-entity" relational facts from image and text data. To facilitate research on this task, we introduce MORE, a new dataset comprising 21 relation types and 20,264 multimodal relational facts annotated on 3,559 pairs of textual news titles and corresponding images. To show the challenges of Multimodal Object-Entity Relation Extraction, we evaluated recent state-of-the-art methods for multimodal relation extraction and conducted a comprehensive experimentation analysis on MORE. Our results demonstrate significant challenges for existing methods, underlining the need for further research on this task. Based on our experiments, we identify several promising directions for future research. The MORE dataset and code are available at https://github.com/NJUNLP/MORE.
翻译:从多模态数据中提取关系事实是多媒体与知识图谱领域的关键任务,广泛应用于实际场景。近年研究聚焦于识别那些两个实体均存在于某一模态、并利用其他模态作为补充信息的关系事实。然而,这类工作忽略了大量跨模态存在的多模态关系事实——例如一个实体出现在文本中而另一个出现在图像中。本文提出一项新任务——多模态对象-实体关系抽取,旨在从图像和文本数据中抽取"对象-实体"关系事实。为促进该任务研究,我们构建了MORE数据集,包含21种关系类型、20,264个多模态关系事实,标注于3,559对文本新闻标题及其对应图像上。为揭示多模态对象-实体关系抽取的挑战性,我们评估了近期多模态关系抽取领域的最优方法,并在MORE上进行了全面实验分析。结果表明现有方法面临显著挑战,凸显了该任务亟需进一步研究。基于实验发现,我们指出了若干具有前景的未来研究方向。MORE数据集及代码已开源至https://github.com/NJUNLP/MORE。