Many datasets have been developed to train and evaluate document-level relation extraction (RE) models. Most of these are constructed using real-world data. It has been shown that RE models trained on real-world data suffer from factual biases. To evaluate and address this issue, we present CovEReD, a counterfactual data generation approach for document-level relation extraction datasets using entity replacement. We first demonstrate that models trained on factual data exhibit inconsistent behavior: while they accurately extract triples from factual data, they fail to extract the same triples after counterfactual modification. This inconsistency suggests that models trained on factual data rely on spurious signals such as specific entities and external knowledge $\unicode{x2013}$ rather than on the input context $\unicode{x2013}$ to extract triples. We show that by generating document-level counterfactual data with CovEReD and training models on them, consistency is maintained with minimal impact on RE performance. We release our CovEReD pipeline as well as Re-DocRED-CF, a dataset of counterfactual RE documents, to assist in evaluating and addressing inconsistency in document-level RE.
翻译:许多数据集已被开发用于训练和评估文档级关系抽取模型,其中大多数基于真实世界数据构建。研究表明,基于真实数据训练的RE模型存在事实性偏差。为评估并解决此问题,我们提出CovEReD——一种通过实体替换生成文档级关系抽取反事实数据的方法。我们首先证明基于事实数据训练的模型存在不一致行为:虽然能准确抽取事实数据中的三元组,却在反事实修改后无法抽取相同三元组。这种不一致性表明,基于事实数据训练的模型依赖特定实体和外部知识等伪相关信号——而非输入上下文——来抽取三元组。我们通过CovEReD生成文档级反事实数据并基于其训练模型,在保持RE性能影响最小的前提下实现了模型行为的一致性。我们开源了CovEReD流程及反事实关系抽取数据集Re-DocRED-CF,以助力文档级关系抽取不一致性问题的评估与解决。