How can we better extract entities and relations from text? Using multimodal extraction with images and text obtains more signals for entities and relations, and aligns them through graphs or hierarchical fusion, aiding in extraction. Despite attempts at various fusions, previous works have overlooked many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes innovative pre-training objectives for entity-object and relation-image alignment, extracting objects from images and aligning them with entity and relation prompts for soft pseudo-labels. These labels are used as self-supervised signals for pre-training, enhancing the ability to extract entities and relations. Experiments on three datasets show an average 3.41% F1 improvement over prior SOTA. Additionally, our method is orthogonal to previous multimodal fusions, and using it on prior SOTA fusions further improves 5.47% F1.
翻译:如何更好地从文本中抽取实体和关系?利用包含图像和文本的多模态抽取,可获得更多实体和关系信号,并通过图结构或层级融合对其进行对齐,从而辅助抽取任务。尽管已有多种融合尝试,但以往工作忽略了大量未标注的图像-文本对(如NewsCLIPing)。本文提出创新的预训练目标,用于实体-对象对齐和关系-图像对齐:从图像中提取对象,并将其与实体提示和关系提示对齐以生成软伪标签。这些标签作为自监督信号用于预训练,从而增强实体和关系的抽取能力。在三个数据集上的实验表明,该方法相较于先前最优模型(SOTA)平均提升3.41%的F1值。此外,我们的方法与现有多模态融合方法正交,应用于先前SOTA融合模型时,可进一步将F1值提升5.47%。