Document-level relation extraction (DocRE) attracts more research interest recently. While models achieve consistent performance gains in DocRE, their underlying decision rules are still understudied: Do they make the right predictions according to rationales? In this paper, we take the first step toward answering this question and then introduce a new perspective on comprehensively evaluating a model. Specifically, we first conduct annotations to provide the rationales considered by humans in DocRE. Then, we conduct investigations and reveal the fact that: In contrast to humans, the representative state-of-the-art (SOTA) models in DocRE exhibit different decision rules. Through our proposed RE-specific attacks, we next demonstrate that the significant discrepancy in decision rules between models and humans severely damages the robustness of models and renders them inapplicable to real-world RE scenarios. After that, we introduce mean average precision (MAP) to evaluate the understanding and reasoning capabilities of models. According to the extensive experimental results, we finally appeal to future work to consider evaluating both performance and the understanding ability of models for the development of their applications. We make our annotations and code publicly available.
翻译:文档级关系抽取(DocRE)近年来吸引了越来越多的研究兴趣。尽管模型在DocRE任务上持续取得性能提升,但其底层决策规则仍未得到充分研究:这些模型是否根据合理的依据做出了正确预测?本文首次尝试回答这个问题,并引入评估模型的新视角。具体而言,我们首先通过标注提供人类在DocRE中考虑的合理依据。随后,我们开展调查研究并揭示以下事实:与人类不同,DocRE中具有代表性的最先进(SOTA)模型表现出不同的决策规则。通过我们提出的特定于关系抽取的攻击方法,我们进一步证明模型与人类在决策规则上的显著差异严重损害了模型的鲁棒性,使其难以应用于真实世界的抽取场景。在此之后,我们引入平均精度均值(MAP)来评估模型的理解与推理能力。基于大量实验结果,我们最终呼吁未来研究在开发模型应用时,应同时考虑模型的性能与理解能力。我们已将标注数据和代码公开。