非自愿越狱：论自我提示攻击 (Involuntary Jailbreak: On Self-Prompting Attacks)

In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term \textbf{involuntary jailbreak}. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for \textit{building a bomb}. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the entire guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. We hope this problem can motivate researchers and practitioners to re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in future.

翻译：本研究揭示了大语言模型（LLM）中一种令人担忧的新型漏洞，我们将其称为**非自愿越狱**。与现有越狱攻击不同，此弱点的特殊性在于其不涉及特定的攻击目标（例如生成*制造炸弹*的指令）。先前的攻击方法主要针对LLM防护机制的局部组件，而非自愿越狱则可能潜在地破坏整个防护结构——我们的方法表明该结构具有惊人的脆弱性。我们仅采用单一通用提示即可实现此目标。具体而言，我们指示LLM生成若干通常会被拒绝的问题及其相应的深度回答（而非拒绝回复）。值得注意的是，这种简单的提示策略能持续攻破包括Claude Opus 4.1、Grok 4、Gemini 2.5 Pro和GPT 4.1在内的大多数主流LLM。我们希望该问题能促使研究者和从业者重新评估LLM防护机制的鲁棒性，并为未来构建更强大的安全对齐机制作出贡献。