Defending LLMs against adversarial jailbreak attacks remains an open challenge. Existing defenses rely on binary classifiers that fail when adversarial input falls outside the learned decision boundary, and repeated fine-tuning is computationally expensive while potentially degrading model capabilities. We propose MANATEE, an inference-time defense that uses density estimation over a benign representation manifold. MANATEE learns the score function of benign hidden states and uses diffusion to project anomalous representations toward safe regions--requiring no harmful training data and no architectural modifications. Experiments across Mistral-7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-it demonstrate that MANATEE reduce Attack Success Rate by up to 100\% on certain datasets, while preserving model utility on benign inputs.
翻译:针对大型语言模型(LLM)的对抗性越狱攻击防御仍是一个开放挑战。现有防御方法依赖二元分类器,当对抗性输入超出已学习的决策边界时即告失效,而重复微调计算成本高昂且可能损害模型能力。本文提出MANATEE,一种推理时防御机制,通过对良性表示流形进行密度估计实现安全防护。MANATEE学习良性隐藏状态的评分函数,并利用扩散过程将异常表示投影至安全区域——该方法无需有害训练数据,亦无需修改模型架构。在Mistral-7B-Instruct、Llama-3.1-8B-Instruct和Gemma-2-9B-it上的实验表明,MANATEE在特定数据集上可将攻击成功率降低高达100%,同时在良性输入上保持模型性能。