Large Language Models (LLMs) have become prevalent across diverse sectors, transforming human life with their extraordinary reasoning and comprehension abilities. As they find increased use in sensitive tasks, safety concerns have gained widespread attention. Extensive efforts have been dedicated to aligning LLMs with human moral principles to ensure their safe deployment. Despite their potential, recent research indicates aligned LLMs are prone to specialized jailbreaking prompts that bypass safety measures to elicit violent and harmful content. The intrinsic discrete nature and substantial scale of contemporary LLMs pose significant challenges in automatically generating diverse, efficient, and potent jailbreaking prompts, representing a continuous obstacle. In this paper, we introduce RIPPLE (Rapid Optimization via Subconscious Exploitation and Echopraxia), a novel optimization-based method inspired by two psychological concepts: subconsciousness and echopraxia, which describe the processes of the mind that occur without conscious awareness and the involuntary mimicry of actions, respectively. Evaluations across 6 open-source LLMs and 4 commercial LLM APIs show RIPPLE achieves an average Attack Success Rate of 91.5\%, outperforming five current methods by up to 47.0\% with an 8x reduction in overhead. Furthermore, it displays significant transferability and stealth, successfully evading established detection mechanisms. The code of our work is available at \url{https://github.com/SolidShen/RIPPLE_official/tree/official}
翻译:大型语言模型(LLMs)已在多个领域广泛应用,凭借其卓越的推理与理解能力深刻改变了人类生活。随着其在敏感任务中的使用日益增多,安全问题引发了广泛关注。为确保模型安全部署,研究者投入大量精力使LLMs与人类道德原则对齐。然而近期研究表明,即便经过对齐的LLMs仍易受特定越狱提示词攻击,这些提示词可绕过安全机制生成暴力及有害内容。当前LLMs固有的离散特性和庞大规模,使得自动生成多样化、高效且强效的越狱提示词面临重大挑战,成为持续存在的障碍。本文提出RIPPLE(基于潜意识利用和模仿行为的快速优化方法),这是一种受两种心理学概念启发的新型优化方法:潜意识(描述无意识心理过程)和模仿行为(描述无意识模仿动作的特性)。在6个开源LLMs和4个商业LLM API上的评估表明,RIPPLE平均攻击成功率达91.5%,以8倍计算开销缩减的优势超越现有五种方法最高达47.0%。此外,该方法展现出显著的迁移性和隐蔽性,能成功规避现有检测机制。我们的代码已发布于\url{https://github.com/SolidShen/RIPPLE_official/tree/official}