Language models exhibit scaling laws, whereby increasing model and dataset size yields predictable decreases in negative log likelihood, unlocking a dazzling array of capabilities. This phenomenon spurs many companies to train ever larger models in pursuit of ever improved performance. Yet, these models are vulnerable to adversarial inputs such as ``jailbreaks'' and prompt injections that induce models to perform undesired behaviors, posing a growing risk as models become more capable. Prior work indicates that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale? We study this question empirically in the classification setting, finding that without explicit defense training, larger models tend to be modestly more robust on most tasks, though the effect is not reliable. Even with the advantage conferred by scale, undefended models remain easy to attack in absolute terms, and we thus turn our attention to explicitly training models for adversarial robustness, which we show to be a much more compute-efficient defense than scaling model size alone. In this setting, we also observe that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models. Finally, we analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others, suggesting that adversarial training alone is not sufficient to solve robustness, even at greater model scales.
翻译:语言模型展现出缩放定律,即增加模型与数据集规模可预测性地降低负对数似然,从而解锁一系列令人瞩目的能力。这一现象促使众多公司训练越来越大的模型以追求持续提升的性能。然而,这些模型易受对抗性输入(如“越狱”攻击与提示注入)的影响,导致模型执行非预期行为;随着模型能力增强,此类风险日益凸显。先前研究表明计算机视觉模型会随模型与数据规模的扩大而变得更鲁棒,这引出一个关键问题:语言模型的鲁棒性是否也会随规模提升而改善?我们在分类任务设定下对此问题展开实证研究,发现若未进行显式防御训练,较大模型在多数任务上往往表现出适度增强的鲁棒性,但该效应并不稳定。即便具备规模优势,未经防御的模型在绝对意义上仍易受攻击,因此我们将研究重点转向针对对抗鲁棒性的显式训练,并证明相较于单纯扩大模型规模,对抗训练是一种计算效率更高的防御策略。在此设定下,我们还观察到经过对抗训练的较大模型,在泛化至训练中未见的变体攻击时,较小模型展现出更快、更好的适应能力。最后,我们分析了计算资源增长对攻防平衡的影响:在某些设定中攻防双方达到均衡,而在另一些设定中攻击方占据优势。这表明即使模型规模扩大,仅依靠对抗训练仍不足以彻底解决鲁棒性问题。