Multilingual Jailbreak Challenges in Large Language Models

While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English data. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risk scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs. Warning: This paper contains examples with potentially harmful content.

翻译：尽管大型语言模型（LLMs）在广泛任务中展现出卓越能力，但其存在潜在安全隐患，例如"越狱"问题——恶意指令可操控LLMs产生不当行为。尽管已开发多种预防措施以缓解LLMs相关风险，但现有方法主要聚焦于英语数据。本研究表明，LLMs中存在多语言越狱挑战，并考虑了两种潜在风险场景：无意识场景与有意识场景。无意识场景指用户使用非英语提示词查询LLMs时无意绕过安全机制；有意识场景则涉及恶意用户将恶意指令与多语言提示词结合，蓄意攻击LLMs。实验结果显示：在无意识场景中，非安全内容生成率随语言资源可用性降低而上升。具体而言，低资源语言出现有害内容的概率是高资源语言的三倍（ChatGPT与GPT-4均如此）。在有意识场景中，多语言提示词会加剧恶意指令的负面影响，导致非安全输出率异常升高——ChatGPT达80.92%，GPT-4达40.71%。为应对多语言环境下的这一挑战，我们提出创新性框架\textsc{Self-Defense}，可自动生成用于安全微调的多语言训练数据。实验结果表明，使用此类数据微调的ChatGPT能显著降低非安全内容生成。数据获取地址：https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs。警告：本文包含可能含有有害内容的示例。