LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Chen Bo Calvin Zhang,Christina Q. Knight,Nicholas Kruus,Jason Hausenloy,Pedro Medeiros,Nathaniel Li,Aiden Kim,Yury Orlovskiy,Coleman Breen,Bryce Cai,Jasper Götting,Andrew Bo Liu,Samira Nedungadi,Paula Rodriguez,Yannis Yiming He,Mohamed Shaaban,Zifan Wang,Seth Donoughe,Julian Michael

from arxiv, 59 pages, 33 figures

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

翻译：大型语言模型（LLM）在生物学基准测试中表现日益优异，但其是否能提升新手用户的能力——即能否使人类在仅使用互联网资源的基础上获得更好的表现——仍不明确。这一不确定性对于理解科学加速和双重用途风险至关重要。我们开展了一项多模型、多基准的人类提升研究，比较了在八组与生物安全相关的任务中，能够访问LLM的新手与仅能访问互联网资源的对照组的表现。参与者在充足的时间（最复杂的任务耗时长达13小时）内处理复杂问题。研究发现，访问LLM能带来显著提升：使用LLM的新手准确率是对照组的4.16倍（95%置信区间[2.63, 6.87]）。在四个具备专家基线（仅使用互联网）的基准测试中，使用LLM的新手在其中三项上超越了专家表现。可能令人惊讶的是，独立运行的LLM常常优于LLM辅助的新手，这表明用户未能充分激发LLM所能提供的最强能力。尽管存在安全防护措施，大多数参与者（89.6%）报告在获取与双重用途相关的信息时几乎没有遇到困难。总体而言，LLM显著提升了新手在以往仅由受过训练的专业人员处理的生物学任务上的表现，这强调了在传统基准测试之外，需要持续开展交互式提升评估。