Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating malicious content have emerged. In this paper, we explore the power of In-Context Learning (ICL) in manipulating the alignment ability of LLMs. We find that by providing just few in-context demonstrations without fine-tuning, LLMs can be manipulated to increase or decrease the probability of jailbreaking, i.e. answering malicious prompts. Based on these observations, we propose In-Context Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding aligned language model purposes. ICA crafts malicious contexts to guide models in generating harmful outputs, while ICD enhances model robustness by demonstrations of rejecting to answer harmful prompts. Our experiments show the effectiveness of ICA and ICD in increasing or reducing the success rate of adversarial jailbreaking attacks. Overall, we shed light on the potential of ICL to influence LLM behavior and provide a new perspective for enhancing the safety and alignment of LLMs.
翻译:大型语言模型(LLMs)在各类任务中展现出显著成功,但其安全性及生成恶意内容的潜在风险引发广泛关注。本文探究上下文学习(ICL)对操控LLM对齐能力的效度。研究发现,无需微调,仅提供少量上下文演示即可操控LLM,使其越狱概率(即响应恶意提示的概率)升高或降低。基于此观察,我们提出上下文攻击(ICA)与上下文防御(ICD)方法,分别用于实现对齐语言模型的越狱与防护功能。ICA通过构造恶意上下文诱导模型生成有害输出,而ICD则通过演示拒绝回答有害提示来增强模型鲁棒性。实验表明,ICA与ICD能有效提升或降低对抗性越狱攻击的成功率。总体而言,本研究揭示了ICL在影响LLM行为方面的潜力,为增强LLM安全性与对齐性提供了新视角。