Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content through manipulated prompts, known as jailbreak attacks. Although many defense strategies have been proposed, they often require access to the model's internal structure or need additional training, which is impractical for service providers using LLM APIs, such as OpenAI APIs or Claude APIs. In this paper, we propose a moving target defense approach that alters decoding hyperparameters to enhance model robustness against various jailbreak attacks. Our approach does not require access to the model's internal structure and incurs no additional training costs. The proposed defense includes two key components: (1) optimizing the decoding strategy by identifying and adjusting decoding hyperparameters that influence token generation probabilities, and (2) transforming the decoding hyperparameters and model system prompts into dynamic targets, which are continuously altered during each runtime. By continuously modifying decoding strategies and prompts, the defense effectively mitigates the existing attacks. Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested when using LLMs as black-box APIs. Moreover, our defense offers lower inference costs and maintains comparable response quality, making it a potential layer of protection when used alongside other defense methods.
翻译:在大语言模型(LLM)中实施防御至关重要,以应对众多攻击者通过操纵提示(即越狱攻击)利用这些系统生成有害内容的问题。尽管已有多种防御策略被提出,但它们通常需要访问模型的内部结构或进行额外训练,这对于使用LLM API(如OpenAI API或Claude API)的服务提供商而言并不现实。本文提出一种移动目标防御方法,通过调整解码超参数来增强模型对各种越狱攻击的鲁棒性。该方法无需访问模型内部结构,且不产生额外训练成本。所提出的防御包含两个关键组成部分:(1)通过识别并调整影响令牌生成概率的解码超参数来优化解码策略;(2)将解码超参数与模型系统提示转化为动态目标,在每次运行时持续变化。通过持续修改解码策略与提示,该防御能有效缓解现有攻击。实验结果表明,在将LLM作为黑盒API使用的场景下,我们的防御在三个测试模型中均表现出最优的越狱攻击抵御效果。此外,该防御具有更低的推理成本,并能保持相当的响应质量,使其在与其他防御方法协同使用时具备成为潜在保护层的能力。