Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abhay Sheshadri,Aidan Ewart,Phillip Guo,Aengus Lynch,Cindy Wu,Vivek Hebbar,Henry Sleight,Asa Cooper Stickland,Ethan Perez,Dylan Hadfield-Menell,Stephen Casper

Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

翻译：大型语言模型（LLMs）常会表现出其经过显式微调试图避免的不良行为。例如，针对LLM的红队测试研究已发展出多种“越狱”技术，能够从经过无害化微调的模型中诱导出有害文本。近期关于红队测试、模型编辑和可解释性的研究表明，这一挑战源于（对抗性）微调主要起到抑制而非消除LLM不良能力的作用。先前研究提出的潜在对抗训练（LAT）被证明能提升模型对广泛故障类别的鲁棒性。这些研究主要关注非定向潜在空间攻击，即攻击者通过扰动潜在激活来最大化良性行为示例的损失。非定向LAT虽能提供通用型鲁棒性，但未能利用特定故障模式的信息。本文通过实验探索定向LAT方法，使攻击者致力于最小化特定竞争任务的损失。我们发现该方法能有效增强多种前沿技术的性能：首先，定向LAT显著提升了对越狱攻击的防御能力，在计算量降低数个数量级的情况下超越了强基线方法R2D2；其次，该方法能在未知触发器的情况下更有效地消除后门漏洞；最后，定向LAT能更彻底地遗忘特定不良任务的知识，并增强对重新学习的抵抗能力。总体而言，我们的研究结果表明定向LAT可成为防御LLM有害行为的有效工具。