Scientist learn early on how to cite scientific sources to support their claims. Sometimes, however, scientists have challenges determining where a citation should be situated -- or, even worse, fail to cite a source altogether. Automatically detecting sentences that need a citation (i.e., citation worthiness) could solve both of these issues, leading to more robust and well-constructed scientific arguments. Previous researchers have applied machine learning to this task but have used small datasets and models that do not take advantage of recent algorithmic developments such as attention mechanisms in deep learning. We hypothesize that we can develop significantly accurate deep learning architectures that learn from large supervised datasets constructed from open access publications. In this work, we propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations. We also produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets. Our experiments show that our architecture achieves state of the art performance on the standard ACL-ARC dataset ($F_{1}=0.507$) and exhibits high performance ($F_{1}=0.856$) on the new PMOA-CITE. Moreover, we show that it can transfer learning across these datasets. We further use interpretable models to illuminate how specific language is used to promote and inhibit citations. We discover that sections and surrounding sentences are crucial for our improved predictions. We further examined purported mispredictions of the model, and uncovered systematic human mistakes in citation behavior and source data. This opens the door for our model to check documents during pre-submission and pre-archival procedures. We make this new dataset, the code, and a web-based tool available to the community.
翻译:科学家在早期学习如何引用科学文献以支持其主张。然而,有时研究人员在确定引文放置位置时面临困难,更糟糕的是可能完全遗漏引用来源。自动检测需要引用的句子(即引文合理度)可同时解决这两个问题,从而构建更稳健且结构更完善的科学论证。以往研究者曾将机器学习应用于此任务,但受限于小规模数据集以及未能采用注意力机制等深度学习最新算法进展的模型。我们假设可构建从开放获取出版物构建的大规模监督数据集中学习的高精度深度学习架构。本文提出一种融合注意力机制与上下文信息的双向长短期记忆网络用于检测需要引用的句子,同时基于PubMed开放获取子集构建了规模比以往数据集大数个数量级的新大型数据集PMOA-CITE。实验表明,该架构在标准ACL-ARC数据集上达到最优性能($F_{1}=0.507$),并在PMOA-CITE数据集上展现高性能($F_{1}=0.856$)。此外,我们证明该架构可在这些数据集间进行迁移学习,并利用可解释模型揭示特定语言如何促进或抑制引文使用。研究发现,文档章节与上下文句对提升预测精度至关重要。进一步分析模型误判案例后,我们揭示了引文行为及源数据中的人类系统性错误。这使模型可在投稿前与归档前流程中检查文档。本文向社区开放了这一新数据集、代码及网络工具。