LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination - across 23 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.
翻译:当大语言模型在包含少量污染或有害数据的投毒数据集上进行训练时,会产生有害且不受欢迎的行为。我们提出了一种新的攻击范式——越狱调优,它将数据投毒与越狱攻击相结合,能够完全绕过最先进的安全防护措施,使GPT-4o等模型几乎满足任何有害请求。实验表明,这种攻击代表了漏洞诱发的范式转变,与正常微调相比,其拒绝率差异可达60个百分点以上。鉴于这一演示揭示了数据投毒漏洞如何持续存在并可能被放大,我们研究了这些风险是否会随着模型规模的扩大而增加。我们在23个参数量从15亿到720亿的前沿大语言模型上评估了三种威胁模型:恶意微调、不完善的数据筛选和故意数据污染。实验表明,更大的大语言模型对数据投毒明显更敏感,即使仅接触极少量的有害数据,它们也比小模型更快地学会有害行为。这些发现强调,领先的人工智能公司在公开发布微调API之前需要进行彻底的对抗测试,并开发更强大的防护措施来抵御数据投毒,尤其是在模型规模和能力持续扩大的背景下。