Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

Shen Zhou Hong,Alex Kleinman,Alyssa Mathiowetz,Adam Howes,Julian Cohen,Suveer Ganta,Alex Letizia,Dora Liao,Deepika Pahari,Xavier Roberts-Gaal,Luca Righetti,Joe Torres

Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.

翻译：大型语言模型（LLM）在生物学基准测试中表现优异，这引发了人们对其可能帮助新手行为者获取具有双重用途的实验室技能的担忧。然而，这是否能转化为人类在实体实验室中表现的提升尚不明确。为解决这一问题，我们开展了一项预先注册、研究者盲法、随机对照试验（2025年6月至8月；n = 153），评估LLM是否能提高新手在共同模拟病毒反向遗传学工作流程的各项任务中的表现。我们观察到在工作流程完成率这一主要终点上无显著差异（LLM组5.2% vs. 互联网组6.6%；P = 0.759），各单项任务的成功率亦无显著差异。然而，LLM组在五项任务中的四项上显示出数值上更高的成功率，其中细胞培养任务最为明显（LLM组68.8% vs. 互联网组55.3%；P = 0.059）。对汇总数据进行事后贝叶斯建模估计，在LLM辅助下，一项"典型"反向遗传学任务的成功率约增加1.4倍（95% CrI 0.74-2.62）。有序回归模型表明，LLM组的参与者在所有任务中更有可能推进完成中间步骤（存在正向效应的后验概率：81%-96%）。总体而言，2025年中期的LLM并未显著提高新手完成复杂实验室流程的整体成功率，但与适度的表现提升相关。这些结果揭示了计算机模拟基准测试与实际效用之间的差距，并强调随着模型能力和用户熟练度的演变，有必要对人工智能生物安全评估进行物理世界的验证。