LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Chen Bo Calvin Zhang,Christina Q. Knight,Nicholas Kruus,Jason Hausenloy,Pedro Medeiros,Nathaniel Li,Aiden Kim,Yury Orlovskiy,Coleman Breen,Bryce Cai,Jasper Götting,Andrew Bo Liu,Samira Nedungadi,Paula Rodriguez,Yannis Yiming He,Mohamed Shaaban,Zifan Wang,Seth Donoughe,Julian Michael

from arxiv, 59 pages, 33 figures

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

翻译：大型语言模型（LLM）在生物学基准测试中表现日益优异，但其是否能够提升新手用户的能力——即帮助人类取得优于仅使用互联网资源的表现——仍不明确。这一不确定性对于理解科学加速与双重用途风险至关重要。我们开展了一项多模型、多基准的人类能力提升研究，比较了在八组与生物安全相关的任务中，使用LLM的新手与仅使用互联网资源的新手之间的表现。参与者在充足时间内（最复杂任务耗时长达13小时）处理复杂问题。研究发现，使用LLM能带来显著的能力提升：使用LLM的新手准确率是对照组的4.16倍（95%置信区间[2.63, 6.87]）。在四个具备专家基线（仅使用互联网）的基准测试中，使用LLM的新手在其中三项上超越了专家表现。可能令人意外的是，独立运行的LLM往往优于LLM辅助的新手，这表明用户未能充分激发LLM的最强潜力。尽管存在安全防护措施，大多数参与者（89.6%）报告获取双重用途相关信息时未遇明显困难。总体而言，LLM显著提升了新手在以往仅限专业人员的生物学任务中的表现，这强调需要在传统基准测试之外开展持续、交互式的能力提升评估。