Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help malicious actors leverage more capable future models to inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version tuned to remove censorship. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Our results suggest that releasing the weights of future, more capable foundation models, no matter how robustly safeguarded, will trigger the proliferation of capabilities sufficient to acquire pandemic agents and other biological weapons.
翻译:大型语言模型能够通过提供融合多领域专业知识的教程来促进科研与人类认知。经过适当安全防护的模型会拒绝提供可能被滥用以造成严重伤害的“双重用途”洞见,但部分权重公开的模型在发布数日内即遭调校以移除安全机制。本研究探讨了持续扩散模型权重是否可能协助恶意行为者利用功能更强大的未来模型造成大规模伤亡。我们组织了一场黑客马拉松,要求参与者通过向“基础版”Llama-2-70B模型及经调校移除审查机制的“加强版”模型输入恶意提示,探索1918年大流行流感病毒的获取与释放方法。基础版模型通常拒绝恶意提示,而加强版模型则向部分参与者提供了获取该病毒所需的几乎所有关键信息。研究结果表明,公开未来更强大基础模型的权重(无论安全保障如何严密),都将触发获取大流行病原体及其他生物武器的能力泛滥。