Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.
翻译:大语言模型面临日益严重的越狱攻击威胁,这些攻击旨在绕过安全对齐机制。虽然基于提示的防御策略(如角色导向提示和任务导向提示)已显示出有效性,但少样本演示在这些防御策略中的作用尚不明确。先前研究表明少样本示例可能损害安全性,但缺乏对少样本如何与不同系统提示策略交互的深入探讨。本文通过在四个安全基准数据集(AdvBench、HarmBench、SG-Bench、XSTest)上使用六种越狱攻击方法,对多个主流大语言模型进行全面评估。我们的核心发现表明,少样本演示对角色导向提示和任务导向提示产生相反的影响:少样本通过强化角色认同使角色导向提示的安全率提升最高达4.5%,同时通过分散对任务指令的注意力使任务导向提示的防御效果降低最高达21.2%。基于这些发现,我们为实际应用场景中部署基于提示的防御机制提供了实用建议。