语言模型规模对其鲁棒性的影响 (Effects of Scale on Language Model Robustness)

Language models exhibit scaling laws, whereby increasing model and dataset size yields predictable decreases in negative log likelihood, unlocking a dazzling array of capabilities. This phenomenon spurs many companies to train ever larger models in pursuit of ever improved performance. Yet, these models are vulnerable to adversarial inputs such as ``jailbreaks'' and prompt injections that induce models to perform undesired behaviors, posing a growing risk as models become more capable. Prior work indicates that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale? We study this question empirically in the classification setting, finding that without explicit defense training, larger models tend to be modestly more robust on most tasks, though the effect is not reliable. Even with the advantage conferred by scale, undefended models remain easy to attack in absolute terms, and we thus turn our attention to explicitly training models for adversarial robustness, which we show to be a much more compute-efficient defense than scaling model size alone. In this setting, we also observe that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models. Finally, we analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others, suggesting that adversarial training alone is not sufficient to solve robustness, even at greater model scales.

翻译：语言模型展现出缩放定律，即增加模型与数据集规模可预测性地降低负对数似然，从而解锁一系列令人瞩目的能力。这一现象促使众多公司训练越来越大的模型以追求持续提升的性能。然而，这些模型易受对抗性输入（如“越狱”攻击与提示注入）的影响，导致模型执行非预期行为；随着模型能力增强，此类风险日益凸显。先前研究表明计算机视觉模型会随模型与数据规模的扩大而变得更鲁棒，这引出一个关键问题：语言模型的鲁棒性是否也会随规模提升而改善？我们在分类任务设定下对此问题展开实证研究，发现若未进行显式防御训练，较大模型在多数任务上往往表现出适度增强的鲁棒性，但该效应并不稳定。即便具备规模优势，未经防御的模型在绝对意义上仍易受攻击，因此我们将研究重点转向针对对抗鲁棒性的显式训练，并证明相较于单纯扩大模型规模，对抗训练是一种计算效率更高的防御策略。在此设定下，我们还观察到经过对抗训练的较大模型，在泛化至训练中未见的变体攻击时，较小模型展现出更快、更好的适应能力。最后，我们分析了计算资源增长对攻防平衡的影响：在某些设定中攻防双方达到均衡，而在另一些设定中攻击方占据优势。这表明即使模型规模扩大，仅依靠对抗训练仍不足以彻底解决鲁棒性问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日