Visual relation extraction (VRE) aims to extract relations between entities from visuallyrich documents. Existing methods usually predict relations for each entity pair independently based on entity features but ignore the global structure information, i.e., dependencies between entity pairs. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledgeguided relation Extraction (GOSE) framework, which captures dependencies between entity pairs in an iterative manner. Given a scanned image of the document, GOSE firstly generates preliminary relation predictions on entity pairs. Secondly, it mines global structure knowledge based on prediction results of the previous iteration and further incorporates global structure knowledge into entity representations. This "generate-capture-incorporate" schema is performed multiple times so that entity representations and global structure knowledge can mutually reinforce each other. Extensive experiments show that GOSE not only outperforms previous methods on the standard fine-tuning setting but also shows promising superiority in cross-lingual learning; even yields stronger data-efficient performance in the low-resource setting.
翻译:视觉关系抽取(Visual Relation Extraction, VRE)旨在从富视觉文档中抽取实体间的关系。现有方法通常基于实体特征独立预测每个实体对的关系,但忽略了全局结构信息,即实体对之间的依赖关系。缺乏全局结构信息可能导致模型难以学习长距离关系,并容易预测出冲突的结果。为解决上述局限,我们提出了一种全局结构知识指导的关系抽取(GOSE)框架,该框架以迭代方式捕获实体对间的依赖关系。给定文档的扫描图像,GOSE首先生成实体对的初步关系预测,其次基于上一轮迭代的预测结果挖掘全局结构知识,并将全局结构知识融入实体表示中。这种“生成-捕获-融合”模式被多次执行,使得实体表示与全局结构知识能够相互增强。大量实验表明,GOSE不仅在标准微调设置下优于先前方法,还在跨语言学习中展现出显著优势,甚至在低资源设置下也表现出更强的数据效率性能。