Efficient Adversarial Training in LLMs with Continuous Attacks

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on four models from different families (Gemma, Phi3, Mistral, Zephyr) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.

翻译：大型语言模型（LLMs）易受对抗性攻击的影响，这些攻击可能绕过其安全防护机制。在许多领域，对抗训练已被证明是可靠提升模型对此类攻击鲁棒性的最具前景的方法之一。然而，在LLMs的背景下，当前对抗训练方法受限于每次训练迭代中执行离散对抗攻击所需的高昂计算成本。我们通过转而计算LLM连续嵌入空间中的对抗攻击来解决这一问题，该方法的效率高出数个数量级。我们提出了一种快速对抗训练算法（C-AdvUL），该算法包含两个损失函数：第一个损失函数使模型在基于对抗行为数据集计算的连续嵌入攻击上具有鲁棒性；第二个损失函数通过在效用数据上进行微调来确保最终模型的实用性。此外，我们提出了C-AdvIPO，这是IPO的对抗性变体，它不需要效用数据即可实现对抗性鲁棒对齐。我们在来自不同模型家族（Gemma、Phi3、Mistral、Zephyr）和不同规模（2B、3.8B、7B）的四个模型上进行的实证评估表明，两种算法均能显著增强LLM针对离散攻击（GCG、AutoDAN、PAIR）的鲁棒性，同时保持其实用性。我们的结果表明，对连续扰动的鲁棒性可以外推至离散威胁模型。由此，我们为LLMs的鲁棒对齐提供了一条可扩展对抗训练算法的路径。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日