The rapid advancement of Large Language Models (LLMs) has brought about remarkable capabilities in natural language processing but also raised concerns about their potential misuse. While strategies like supervised fine-tuning and reinforcement learning from human feedback have enhanced their safety, these methods primarily focus on natural languages, which may not generalize to other domains. This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs, presenting a novel environment for testing the safety generalization of LLMs. Our comprehensive studies on state-of-the-art LLMs including GPT-4, Claude-2, and Llama-2 series reveal a common safety vulnerability of these models against code input: CodeAttack consistently bypasses the safety guardrails of all models more than 80\% of the time. Furthermore, we find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization, such as encoding natural language input with data structures or using less popular programming languages. These findings highlight new safety risks in the code domain and the need for more robust safety alignment algorithms to match the code capabilities of LLMs.
翻译:大语言模型(LLMs)的快速发展带来了自然语言处理领域的卓越能力,但也引发了对其潜在滥用的担忧。尽管监督微调与基于人类反馈的强化学习等策略已提升了其安全性,但这些方法主要聚焦于自然语言,可能无法泛化至其他领域。本文提出CodeAttack框架,将自然语言输入转换为代码输入,为测试LLM的安全泛化能力提供了新环境。我们对包括GPT-4、Claude-2及Llama-2系列在内的最先进LLM进行综合研究,揭示了这些模型在代码输入下普遍存在的安全漏洞:CodeAttack在超过80%的情况下成功绕过了所有模型的安全护栏。此外,我们发现CodeAttack与自然语言之间更大的分布差距会导致更弱的安全泛化能力,例如使用数据结构编码自然语言输入或采用较少流行的编程语言。这些发现突显了代码领域的新安全风险,以及开发更稳健的安全对齐算法以匹配LLM代码能力的迫切需求。