End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine. In this paper, we aim to compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases involving discontinuous and nested entities. We use the RareDis information extraction dataset to evaluate three competing approaches (for E2ERE): NER $\rightarrow$ RE pipelines, joint sequence to sequence models, and generative pre-trained transformer (GPT) models. We use comparable state-of-the-art models and best practices for each of these approaches and conduct error analyses to assess their failure modes. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind; GPT models with eight times as many parameters are worse than even sequence-to-sequence models and lose to pipeline models by over 10 F1 points. Partial matches and discontinuous entities caused many NER errors contributing to lower overall E2E performances. We also verify these findings on a second E2ERE dataset for chemical-protein interactions. Although generative LM-based methods are more suitable for zero-shot settings, when training data is available, our results show that it is better to work with more conventional models trained and tailored for E2ERE. More innovative methods are needed to marry the best of the both worlds from smaller encoder-decoder pipeline models and the larger GPT models to improve E2ERE. As of now, we see that well designed pipeline models offer substantial performance gains at a lower cost and carbon footprint for E2ERE. Our contribution is also the first to conduct E2ERE for the RareDis dataset.
翻译:端到端关系抽取(E2ERE)是自然语言处理(NLP)在生物医学领域的重要且实际的应用。本文旨在使用一个涉及不连续实体和嵌套实体的复杂罕见疾病数据集,比较三种主流的E2ERE范式。我们利用RareDis信息抽取数据集评估了三种竞争性方法(用于E2ERE):命名实体识别→关系抽取流水线、联合序列到序列模型以及生成式预训练Transformer(GPT)模型。我们对每种方法采用了可比较的最先进模型及最佳实践,并通过错误分析评估其失败模式。研究发现,流水线模型仍表现最佳,序列到序列模型紧随其后;具有八倍参数量的GPT模型甚至不如序列到序列模型,且比流水线模型低超过10个F1分值。部分匹配和不连续实体导致大量命名实体识别错误,进而降低了整体端到端性能。我们还在第二个化学-蛋白质相互作用E2ERE数据集上验证了这些发现。尽管基于生成式语言模型的方法更适合零样本场景,但当训练数据可用时,我们的结果表明,使用针对E2ERE定制训练的更为传统的模型效果更优。需要更具创新性的方法,以融合小型编码器-解码器流水线模型和大型GPT模型的各自优势,从而改进E2ERE。目前,我们观察到,设计良好的流水线模型在更低成本和碳足迹下为E2ERE提供了显著的性能提升。我们的贡献还在于首次对RareDis数据集进行了端到端关系抽取研究。