Large language models (LLMs) are becoming increasingly integrated into mainstream development platforms and daily technological workflows, typically behind moderation and safety controls. Despite these controls, preventing prompt-based policy evasion remains challenging, and adversaries continue to jailbreak LLMs by crafting prompts that circumvent implemented safety mechanisms. While prior jailbreak techniques have explored obfuscation and contextual manipulation, many operate as single-step transformations, and their effectiveness is inconsistent across current state-of-the-art models. This leaves a limited understanding of multistage prompt-transformation attacks that evade moderation, reconstruct forbidden intent, and elicit policy-violating outputs. This paper introduces RoguePrompt, an automated jailbreak pipeline that leverages dual-layer prompt transformations to convert forbidden prompts into safety-evading queries. By partitioning the forbidden prompts and applying two nested encodings (ROT-13 and Vigenère) along with natural-language decoding instructions, it produces benign-looking prompts that evade filters and induce the model to execute the original prompt within a single query. RoguePrompt was developed and evaluated under a black-box threat model, with only API and UI access to the LLMs, and tested on 313 real-world hard-rejected prompts. Success was measured in terms of moderation bypass, instruction reconstruction, and execution, using both automated and human evaluation. It achieved an average of 93.93% filter bypass, 79.02% reconstruction, and 70.18% execution success across multiple frontier LLMs. These results demonstrate the effectiveness of layered prompt encoding and highlight the need for innovative defenses to detect and mitigate self-reconstructing jailbreaks.
翻译:大语言模型正日益融入主流开发平台和日常技术工作流,通常部署在内容审核与安全控制机制之后。尽管存在这些控制措施,防止基于提示词的政策规避仍然具有挑战性,攻击者持续通过精心构造能绕过现有安全机制的提示词来实现对大语言模型的越狱。虽然先前的越狱技术已探索过混淆和上下文操纵方法,但多数操作属于单步转换,且其在当前最先进模型上的有效性表现不一。这导致人们对能够规避审核、重构禁止意图并诱发违反政策输出的多阶段提示词转换攻击的理解仍然有限。本文提出RoguePrompt,一种自动化越狱流程,利用双层提示词转换将禁止性提示词转化为可规避安全检查的查询。该方法通过对禁止性提示词进行分割,并应用两种嵌套编码(ROT-13与维吉尼亚密码)结合自然语言解码指令,生成表面无害的提示词以绕过过滤器,并诱导模型在单次查询中执行原始提示词。RoguePrompt在仅通过API和用户界面访问大语言模型的黑盒威胁模型下开发评估,使用313个真实场景中被严格拒绝的提示词进行测试。通过自动化和人工评估相结合的方式,从审核绕过、指令重构和执行效果三个维度衡量成功率。在多个前沿大语言模型上,该方法平均实现93.93%的过滤器绕过率、79.02%的重构成功率及70.18%的执行成功率。这些结果证明了分层提示词编码的有效性,同时凸显了需要创新防御机制来检测和缓解自重构越狱攻击的迫切需求。