Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on four models from different families (Gemma, Phi3, Mistral, Zephyr) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.
翻译:大型语言模型(LLMs)易受对抗性攻击的影响,这些攻击可能绕过其安全防护机制。在许多领域,对抗训练已被证明是可靠提升模型对此类攻击鲁棒性的最具前景的方法之一。然而,在LLMs的背景下,当前对抗训练方法受限于每次训练迭代中执行离散对抗攻击所需的高昂计算成本。我们通过转而计算LLM连续嵌入空间中的对抗攻击来解决这一问题,该方法的效率高出数个数量级。我们提出了一种快速对抗训练算法(C-AdvUL),该算法包含两个损失函数:第一个损失函数使模型在基于对抗行为数据集计算的连续嵌入攻击上具有鲁棒性;第二个损失函数通过在效用数据上进行微调来确保最终模型的实用性。此外,我们提出了C-AdvIPO,这是IPO的对抗性变体,它不需要效用数据即可实现对抗性鲁棒对齐。我们在来自不同模型家族(Gemma、Phi3、Mistral、Zephyr)和不同规模(2B、3.8B、7B)的四个模型上进行的实证评估表明,两种算法均能显著增强LLM针对离散攻击(GCG、AutoDAN、PAIR)的鲁棒性,同时保持其实用性。我们的结果表明,对连续扰动的鲁棒性可以外推至离散威胁模型。由此,我们为LLMs的鲁棒对齐提供了一条可扩展对抗训练算法的路径。