Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual subtasks and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which is available at https://github.com/amazon-science/e2e-docie.
翻译:从非结构化文本中提取结构化信息是许多下游自然语言处理应用的关键任务,传统上通过闭合信息抽取(cIE)实现。然而,现有cIE方法存在两个局限:(i)通常采用流水线架构,容易导致错误传播;(ii)局限于句子级别,无法捕获长距离依赖关系,并导致推理时间过长。针对这些局限,我们提出REXEL——一种面向文档级闭合信息抽取(DocIE)联合任务的高效精准模型。REXEL在单次前向传播中同时完成提及检测、实体类型识别、实体消歧、共指消解及文档级关系分类,直接生成与参考知识图谱完全链接的事实三元组。在相似实验设置下,其平均运行速度比现有竞争方法快11倍,且无论针对单个子任务优化还是多种联合任务组合,均能取得极具竞争力的性能,平均比基线模型高出6个F1分数以上。速度与精度的结合使REXEL成为适用于网络规模结构化信息提取的高性价比系统。我们同时发布了DocRED数据集的扩展版本,用于支持未来DocIE任务的基准测试,该数据集可通过https://github.com/amazon-science/e2e-docie获取。