Language models are known to hallucinate incorrect information, and it is unclear if they are sufficiently accurate and reliable for use in scientific research. We developed a rigorous human-AI comparison methodology to evaluate language model agents on real-world literature search tasks covering information retrieval, summarization, and contradiction detection tasks. We show that PaperQA2, a frontier language model agent optimized for improved factuality, matches or exceeds subject matter expert performance on three realistic literature research tasks without any restrictions on humans (i.e., full access to internet, search tools, and time). PaperQA2 writes cited, Wikipedia-style summaries of scientific topics that are significantly more accurate than existing, human-written Wikipedia articles. We also introduce a hard benchmark for scientific literature research called LitQA2 that guided design of PaperQA2, leading to it exceeding human performance. Finally, we apply PaperQA2 to identify contradictions within the scientific literature, an important scientific task that is challenging for humans. PaperQA2 identifies 2.34 +/- 1.99 contradictions per paper in a random subset of biology papers, of which 70% are validated by human experts. These results demonstrate that language model agents are now capable of exceeding domain experts across meaningful tasks on scientific literature.
翻译:众所周知,语言模型会产生幻觉,生成错误信息,因此尚不清楚其准确性和可靠性是否足以用于科学研究。我们开发了一种严谨的人机对比方法,用于评估语言模型智能体在真实世界文献检索任务上的表现,涵盖信息检索、摘要生成和矛盾检测任务。我们证明,PaperQA2——一种为提升事实准确性而优化的前沿语言模型智能体——在三个现实的文献研究任务上,其表现匹配甚至超越了领域专家,且未对人类施加任何限制(即人类可完全访问互联网、使用搜索工具且无时间限制)。PaperQA2撰写的带引用的维基百科风格科学主题摘要,其准确性显著高于现有的人工撰写的维基百科文章。我们还引入了一个名为LitQA2的科学文献研究硬基准,该基准指导了PaperQA2的设计,使其得以超越人类表现。最后,我们应用PaperQA2来识别科学文献中的矛盾,这是一项对人类而言具有挑战性的重要科学任务。在生物学论文的随机子集中,PaperQA2平均每篇论文识别出2.34 +/- 1.99个矛盾,其中70%得到了人类专家的验证。这些结果表明,语言模型智能体如今已能够在科学文献的有意义任务上超越领域专家。