Motivation: Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritise their efforts. Results: In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for non-coding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We also applied the most commonly used automated evaluation approaches, finding that they do not correlate with human assessment. Finally, we apply our tool to a selection of over 4,600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided careful prompting and automated checking are applied. Availability: Code used to produce these summaries can be found here: https://github.com/RNAcentral/litscan-summarization and the dataset of contexts and summaries can be found here: https://huggingface.co/datasets/RNAcentral/litsumm-v1. Summaries are also displayed on the RNA report pages in RNAcentral (https://rnacentral.org/)
翻译:动机:生命科学领域的文献整理工作正面临日益严峻的挑战。出版物数量的持续增长,加之全球范围内文献管理员人数相对固定,对生物医学知识库的开发者而言构成了重大挑战。极少有知识库拥有足够资源覆盖全部相关文献,因此所有知识库都必须优先安排工作。结果:在本研究中,我们迈出了缓解RNA科学领域文献管理员短缺问题的第一步——利用大语言模型为非编码RNA自动生成文献摘要。我们证明,通过商业大语言模型结合提示链与自动核查机制,能够从文献中自动生成高质量、事实准确且附带准确参考文献的摘要。我们对部分摘要进行了人工评估,其中绝大多数被评为极高质量。同时,我们采用了最常用的自动化评估方法,发现其评估结果与人工评估并不相关。最后,我们将该工具应用于超过4,600个非编码RNA,并将生成的摘要整合至RNAcentral资源库。我们得出结论:只要采用严谨的提示策略与自动化核查,当前一代大语言模型完全能够胜任自动化文献摘要任务。数据可用性:生成摘要的代码详见https://github.com/RNAcentral/litscan-summarization,上下文与摘要数据集详见https://huggingface.co/datasets/RNAcentral/litsumm-v1。此外,RNAcentral(https://rnacentral.org/)的RNA报告页面也提供了相关摘要。