Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.
翻译:大语言模型的对抗训练是提升对抗鲁棒性最有效的方法之一。然而,尽管取得了显著进展,模型在面对简单的分布内攻击时依然脆弱,例如将提示改写为过去时态或翻译成其他语言。我们认为,这种持续的脆弱性源于当前对抗训练算法的根本局限:它们仅最小化训练集上的对抗损失,却未能充分覆盖数据分布,导致对看似简单的攻击存在漏洞。为弥合这一差距,我们提出了分布对抗训练方法DAT。我们利用扩散大语言模型来近似提示与响应的真实联合分布,从而能够生成多样化、高似然度的样本以解决泛化失败问题。通过将扩散模型提供的数据分布优化与持续对抗训练相结合,DAT实现了比先前方法显著更高的对抗鲁棒性。