Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.
翻译:越狱提示是对大型语言模型(LLMs)的一种实际且持续演变的威胁,尤其对于执行工具操作于非受信内容的智能体系统而言。许多攻击利用长上下文隐藏、语义伪装及轻量级混淆技术,能够规避单次处理的防护机制。本文提出RLM-JB——一个基于递归语言模型(RLMs)构建的端到端越狱检测框架。该框架通过根模型协调一个有界的分析程序,对输入进行变换处理,对覆盖的文本片段调用工作模型进行查询,并将证据聚合为可审计的决策结果。RLM-JB将检测视为一个过程而非单次分类:它首先对可疑输入进行标准化与去混淆处理,随后对文本进行分块以减少上下文稀释并确保覆盖度,执行并行分块筛查,最后整合跨分块信号以恢复分片负载攻击。在AutoDAN式对抗输入测试中,RLM-JB在三种LLM后端上均实现了高检测效能(ASR/召回率92.5-98.0%),同时保持极高的精确率(98.99-100%)与低误报率(0.0-2.0%),凸显了筛查后端变化时灵敏度与特异性的实用权衡关系。