Language model capabilities predictably improve from scaling a model's size and training data. Motivated by this, increasingly large language models have been trained, yielding an array of impressive capabilities. Yet these models are vulnerable to adversarial prompts, such as "jailbreaks" that hijack models to perform undesired behaviors, posing a significant risk of misuse. Prior work indicates that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale? We study this question empirically, finding that larger models respond substantially better to adversarial training, but there is little to no benefit from model scale in the absence of explicit defenses.
翻译:语言模型的能力可预测地随着模型规模和训练数据的扩大而提升。受此驱动,人们训练了规模日益增大的语言模型,从而催生了一系列令人瞩目的能力。然而,这些模型易受对抗性提示的影响,例如能够劫持模型以执行非预期行为的“越狱”攻击,这构成了显著的滥用风险。先前研究表明,计算机视觉模型会随着模型和数据的扩展而变得更加鲁棒,这引出了一个关键问题:语言模型的鲁棒性是否也会随规模扩大而增强?我们通过实证研究探讨了这一问题,发现更大的模型对对抗性训练的反应显著更佳,但在缺乏显式防御机制的情况下,模型规模的扩大几乎不会带来任何益处。