Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcomes of our experiments are fourfold. First, we developed a comprehensive hierarchical taxonomy of jailbreak strategies that systematically consolidates techniques previously studied in isolation and harmonizes existing, partially overlapping classifications with explicit cross-references to prior categorizations. The taxonomy organizes jailbreak strategies into seven mechanism-oriented families: impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmarked GPT-5 as a judge for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

翻译：越狱技术对大型语言模型（LLM）的安全性构成重大威胁。现有防御方法通常专注于单轮攻击，缺乏跨语言覆盖，且依赖有限的分类体系——这些体系要么未能涵盖攻击策略的完整多样性，要么侧重于风险类别而非越狱技术本身。为深化对越狱技术有效性的理解，我们开展了一项结构化的红队对抗挑战。实验成果主要体现在四个方面：首先，我们构建了一个层次化的越狱策略综合分类体系，系统整合了以往孤立研究的技术，并通过显式交叉引用协调了现有部分重叠的分类方法。该分类体系将越狱策略按机制导向归纳为七个类别：身份伪装、说服诱导、权限提升、认知过载、混淆干扰、目标冲突与数据投毒。其次，我们分析了挑战中收集的数据，考察了不同类型攻击的普遍性与成功率，揭示了特定越狱策略如何利用模型漏洞并诱发失准现象。第三，我们以GPT-5作为越狱检测判据进行基准测试，评估了分类学引导提示对提升自动检测效能的积极作用。最后，我们构建了一个包含1364轮多轮对抗对话的意大利语新数据集，并依据本分类体系进行标注，为研究对抗意图逐步显现并成功绕过传统防护机制的交互过程提供了数据基础。