LitSumm: Large language models for literature summarisation of non-coding RNAs

Motivation: Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritise their efforts. Results: In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for non-coding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We also applied the most commonly used automated evaluation approaches, finding that they do not correlate with human assessment. Finally, we apply our tool to a selection of over 4,600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided careful prompting and automated checking are applied. Availability: Code used to produce these summaries can be found here: https://github.com/RNAcentral/litscan-summarization and the dataset of contexts and summaries can be found here: https://huggingface.co/datasets/RNAcentral/litsumm-v1. Summaries are also displayed on the RNA report pages in RNAcentral (https://rnacentral.org/)

翻译：动机：生命科学领域的文献整理工作正面临日益严峻的挑战。出版物数量的持续增长，加之全球范围内文献管理员人数相对固定，对生物医学知识库的开发者而言构成了重大挑战。极少有知识库拥有足够资源覆盖全部相关文献，因此所有知识库都必须优先安排工作。结果：在本研究中，我们迈出了缓解RNA科学领域文献管理员短缺问题的第一步——利用大语言模型为非编码RNA自动生成文献摘要。我们证明，通过商业大语言模型结合提示链与自动核查机制，能够从文献中自动生成高质量、事实准确且附带准确参考文献的摘要。我们对部分摘要进行了人工评估，其中绝大多数被评为极高质量。同时，我们采用了最常用的自动化评估方法，发现其评估结果与人工评估并不相关。最后，我们将该工具应用于超过4,600个非编码RNA，并将生成的摘要整合至RNAcentral资源库。我们得出结论：只要采用严谨的提示策略与自动化核查，当前一代大语言模型完全能够胜任自动化文献摘要任务。数据可用性：生成摘要的代码详见https://github.com/RNAcentral/litscan-summarization，上下文与摘要数据集详见https://huggingface.co/datasets/RNAcentral/litsumm-v1。此外，RNAcentral（https://rnacentral.org/）的RNA报告页面也提供了相关摘要。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日