Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating harmful content have emerged. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA), which employs strategically crafted harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD), which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Moreover, we offer theoretical insights into the mechanism by which a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety and alignment of LLMs.
翻译:大型语言模型(LLMs)在各类任务中展现出显著成功,但其安全性及生成有害内容的潜在风险引发关注。本文深入探究上下文学习(ICL)对LLM对齐行为的调控潜力。具体而言,我们提出上下文攻击(ICA),通过策略性构建有害示例颠覆LLM,同时提出上下文防御(ICD),借助展现拒绝生成有害响应的示例增强模型鲁棒性。通过大量实验,我们证实了ICA与ICD分别能有效提升与降低越狱提示的成功率。此外,我们从理论层面阐释了有限上下文示例如何关键性地影响LLM安全对齐的机制。研究结果揭示了ICL对LLM行为的深远影响,为改进LLM安全性与对齐性开辟了新路径。