In recent years, NLP practitioners have converged on the following practice: (i) import an off-the-shelf pretrained (masked) language model; (ii) append a multilayer perceptron atop the CLS token's hidden representation (with randomly initialized weights); and (iii) fine-tune the entire model on a downstream task (MLP-FT). This procedure has produced massive gains on standard NLP benchmarks, but these models remain brittle, even to mild adversarial perturbations. In this work, we demonstrate surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP), an alternative method of adapting to downstream tasks. Rather than appending an MLP head to make output prediction, MVP appends a prompt template to the input, and makes prediction via text infilling/completion. Across 5 NLP datasets, 4 adversarial attacks, and 3 different models, MVP improves performance against adversarial substitutions by an average of 8% over standard methods and even outperforms adversarial training-based state-of-art defenses by 3.5%. By combining MVP with adversarial training, we achieve further improvements in adversarial robustness while maintaining performance on unperturbed examples. Finally, we conduct ablations to investigate the mechanism underlying these gains. Notably, we find that the main causes of vulnerability of MLP-FT can be attributed to the misalignment between pre-training and fine-tuning tasks, and the randomly initialized MLP parameters.
翻译:近年来,NLP从业者已形成如下通用实践:(i)导入现成的预训练(掩码)语言模型;(ii)在CLS标记的隐藏表示之上附加一个多层感知器(权重随机初始化);(iii)在下游任务上对整个模型进行微调(MLP-FT)。该流程在标准NLP基准测试中取得了显著提升,但这些模型仍然脆弱,甚至难以抵御温和的对抗扰动。在本工作中,我们展示了通过提示进行模型微调(MVP)这一下游任务适配替代方法所获得的显著对抗鲁棒性提升。与附加MLP头进行输出预测不同,MVP在输入中附加提示模板,并通过文本填充/补全进行预测。在5个NLP数据集、4种对抗攻击和3种不同模型上,MVP对抗对抗性词替换的性能平均比标准方法提升8%,甚至比基于对抗训练的最先进防御方法高出3.5%。通过将MVP与对抗训练相结合,我们在保持未扰动样本性能的同时,进一步提升了对抗鲁棒性。最后,我们进行消融实验以探究这些增益背后的机制。值得注意的是,我们发现MLP-FT脆弱性的主要原因可归因于预训练与微调任务之间的不匹配,以及随机初始化的MLP参数。