As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $\textbf{Sysformer}$, a trans$\textbf{former}$ model that updates an initial $\textbf{sys}$tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on $5$ LLMs from different families and $2$ recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80\%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90\%$. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto $100\%$ more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.
翻译:随着大型语言模型(LLMs)在安全关键场景中的部署,确保其响应符合安全标准至关重要。先前研究表明,LLMs往往难以准确理解安全行为的概念,导致对无害提示的不当拒绝或生成有害内容。尽管已有大量研究致力于提升其鲁棒性,现有防御方法通常依赖于成本高昂的模型参数微调,或采用次优的启发式技术。本研究提出一种创新方法,通过学习调整指令微调LLMs中的系统提示来保障模型安全。传统LLMs通常基于固定系统提示进行预训练,而本文探究了根据具体用户输入定制系统提示对响应安全性的影响。为此,我们提出$\textbf{Sysformer}$——一种基于trans$\textbf{former}$架构的模型,该模型在LLM输入嵌入空间中,将初始$\textbf{sys}$tem提示更新为更具鲁棒性的系统提示,同时关注用户提示。在保持LLM参数冻结的前提下,Sysformer被训练为能够拒绝响应有害提示集,同时理想地响应安全提示集。通过对$5$个不同架构系列的LLMs和$2$个最新基准测试的广泛实验,我们证明Sysformer能显著提升LLMs的鲁棒性:对有害提示的拒绝率最高提升$80\%$,同时对安全提示的遵从性最高增强$90\%$。实验结果在复杂越狱攻击中也表现出良好的泛化能力,使LLMs针对不同攻击策略的鲁棒性最高提升$100\%$。我们希望本研究能为低成本保障LLM安全提供新思路,并推动未来对可变系统提示设计的探索。