As the popularity of Large Language Models (LLMs) grow, combining model safety with utility becomes increasingly important. The challenge is making sure that LLMs can recognize and decline dangerous prompts without sacrificing their ability to be helpful. The problem of "exaggerated safety" demonstrates how difficult this can be. To reduce excessive safety behaviours -- which was discovered to be 26.1% of safe prompts being misclassified as dangerous and refused -- we use a combination of XSTest dataset prompts as well as interactive, contextual, and few-shot prompting to examine the decision bounds of LLMs such as Llama2, Gemma Command R+, and Phi-3. We find that few-shot prompting works best for Llama2, interactive prompting works best Gemma, and contextual prompting works best for Command R+ and Phi-3. Using a combination of these prompting strategies, we are able to mitigate exaggerated safety behaviors by an overall 92.9% across all LLMs. Our work presents a multiple prompting strategies to jailbreak LLMs' decision-making processes, allowing them to navigate the tight line between refusing unsafe prompts and remaining helpful.
翻译:随着大型语言模型(LLMs)的普及,平衡模型的安全性与实用性变得愈发重要。挑战在于确保LLMs能够识别并拒绝危险提示,同时不牺牲其提供帮助的能力。"过度安全"问题揭示了这一任务的难度。为减少过度安全行为(研究发现26.1%的安全提示被错误分类为危险并遭拒绝),我们结合使用XSTest数据集提示,以及交互式、上下文式和少样本提示策略,探究Llama2、Gemma、Command R+和Phi-3等LLMs的决策边界。实验发现:少样本提示对Llama2最有效,交互式提示对Gemma最优,上下文式提示则最适合Command R+和Phi-3。通过组合这些提示策略,我们成功将全体LLMs的过度安全行为总体缓解了92.9%。本研究提出多种提示策略以"越狱"LLMs的决策过程,使其能够在拒绝不安全提示与保持助益性之间精准权衡。