Motivation: Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritise their efforts. Results: In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for non-coding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We also applied the most commonly used automated evaluation approaches, finding that they do not correlate with human assessment. Finally, we apply our tool to a selection of over 4,600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided careful prompting and automated checking are applied. Availability: Code used to produce these summaries can be found here: https://github.com/RNAcentral/litscan-summarization and the dataset of contexts and summaries can be found here: https://huggingface.co/datasets/RNAcentral/litsumm-v1. Summaries are also displayed on the RNA report pages in RNAcentral (https://rnacentral.org/)
翻译:动机:生命科学文献的策展正面临日益严峻的挑战。持续增长的发表速率,加上全球策展人员数量相对固定,给生物医学知识库的开发者带来了重大难题。极少数知识库具备覆盖全部相关文献的资源,所有知识库都需优先安排工作。结果:在本研究中,我们迈出了缓解RNA科学领域策展人员不足的第一步——利用大型语言模型(LLMs)自动生成非编码RNA的文献摘要。我们证明,借助商用LLM及一系列提示词与校验链,可以自动从文献中生成高质量、事实准确且附带精确引文的摘要。我们通过人工评估对部分摘要进行了验证,结果显示绝大多数摘要质量极高。同时,我们应用了最常用的自动评估方法,发现这些方法与人类评估结果并不相关。最终,我们将该工具应用于超过4600种非编码RNA,并将生成的摘要通过RNAcentral资源库开放共享。结论表明:当前一代LLM在配合精心设计的提示词与自动化校验时,能够实现可行的文献摘要自动生成。可用性:生成摘要的代码见:https://github.com/RNAcentral/litscan-summarization,上下文与摘要数据集见:https://huggingface.co/datasets/RNAcentral/litsumm-v1。摘要亦显示于RNAcentral(https://rnacentral.org/)的RNA报告页面中。