Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large-scale, systematic processing of scientific knowledge. We present PaperQA, a RAG agent for answering questions over the scientific literature. PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. Viewing this agent as a question answering model, we find it exceeds performance of existing LLMs and LLM agents on current science QA benchmarks. To push the field closer to how humans perform research on scientific literature, we also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature. Finally, we demonstrate PaperQA's matches expert human researchers on LitQA.
翻译:大语言模型(LLMs)在各类语言任务中具有良好的泛化能力,但存在幻觉和不可解释性问题,导致在缺乏真实标注时难以评估其准确性。为减少幻觉现象并提供答案生成的溯源依据,研究者提出了检索增强生成(RAG)模型。将此类模型应用于科学文献领域,有望实现科学知识的大规模系统性处理。本文提出PaperQA——一种面向科学文献问答的RAG智能体。该智能体可对全文科学文章进行信息检索、评估来源与段落的相关性,并利用RAG技术生成答案。将该智能体作为问答模型进行评估时,我们发现其在现有科学问答基准测试中表现优于现有大语言模型及LLM智能体。为推动该领域更接近人类研究科学文献的方式,我们还引入了LitQA这一更复杂的基准测试,该测试要求从文献中全文科学论文中检索并综合信息。最后,我们证明PaperQA在LitQA上的表现与人类专家研究者相当。