Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models

Scientist learn early on how to cite scientific sources to support their claims. Sometimes, however, scientists have challenges determining where a citation should be situated -- or, even worse, fail to cite a source altogether. Automatically detecting sentences that need a citation (i.e., citation worthiness) could solve both of these issues, leading to more robust and well-constructed scientific arguments. Previous researchers have applied machine learning to this task but have used small datasets and models that do not take advantage of recent algorithmic developments such as attention mechanisms in deep learning. We hypothesize that we can develop significantly accurate deep learning architectures that learn from large supervised datasets constructed from open access publications. In this work, we propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations. We also produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets. Our experiments show that our architecture achieves state of the art performance on the standard ACL-ARC dataset ($F_{1}=0.507$) and exhibits high performance ($F_{1}=0.856$) on the new PMOA-CITE. Moreover, we show that it can transfer learning across these datasets. We further use interpretable models to illuminate how specific language is used to promote and inhibit citations. We discover that sections and surrounding sentences are crucial for our improved predictions. We further examined purported mispredictions of the model, and uncovered systematic human mistakes in citation behavior and source data. This opens the door for our model to check documents during pre-submission and pre-archival procedures. We make this new dataset, the code, and a web-based tool available to the community.

翻译：科学家在早期学习如何引用科学文献以支持其主张。然而，有时研究人员在确定引文放置位置时面临困难，更糟糕的是可能完全遗漏引用来源。自动检测需要引用的句子（即引文合理度）可同时解决这两个问题，从而构建更稳健且结构更完善的科学论证。以往研究者曾将机器学习应用于此任务，但受限于小规模数据集以及未能采用注意力机制等深度学习最新算法进展的模型。我们假设可构建从开放获取出版物构建的大规模监督数据集中学习的高精度深度学习架构。本文提出一种融合注意力机制与上下文信息的双向长短期记忆网络用于检测需要引用的句子，同时基于PubMed开放获取子集构建了规模比以往数据集大数个数量级的新大型数据集PMOA-CITE。实验表明，该架构在标准ACL-ARC数据集上达到最优性能（$F_{1}=0.507$），并在PMOA-CITE数据集上展现高性能（$F_{1}=0.856$）。此外，我们证明该架构可在这些数据集间进行迁移学习，并利用可解释模型揭示特定语言如何促进或抑制引文使用。研究发现，文档章节与上下文句对提升预测精度至关重要。进一步分析模型误判案例后，我们揭示了引文行为及源数据中的人类系统性错误。这使模型可在投稿前与归档前流程中检查文档。本文向社区开放了这一新数据集、代码及网络工具。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日