DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to express and execute it, thereby evading safety alignment while remaining recoverable during generation. Motivated by this observation, we propose DoubtProbe, a dual-branch inference-time defense framework that combines structural verification with semantic auditing and formulates black-box jailbreak defense as consistency checking under controlled transformation. The structural branch extracts a structured representation from the original request, reconstructs the request under representation constraints, and detects information-preservation failures between the original and reconstructed requests; the semantic branch audits the original prompt directly. We evaluate DoubtProbe against representative black-box defenses on jailbreak and benign-request benchmarks, and further test backbone transfer from Qwen2.5-72B to Llama-3.1-70B. Results show that DoubtProbe achieves a stronger and more stable defense-utility trade-off: on Qwen2.5-72B, it reduces the JBB attack success rate from 0.293 to 0.100 and the CodeAttack attack success rate from 0.152 to 0.001, while maintaining false positive rates of 0.022 and 0.016 on AlpacaEval and OR-Bench; the same pattern remains stable on Llama-3.1-70B. These findings show that structural inconsistency signals provide a practical and generalizable basis for black-box jailbreak defense, especially when combined with semantic auditing.

翻译：随着大语言模型（LLM）越来越多地部署在面向用户的系统中，黑盒越狱防御已成为一个重要的实际问题。现有防御方法通常依赖已知攻击覆盖范围、提示级语义判断或本地运行时控制，但在不断演变的提示包装、表达重写和结构操纵下，这些路径可能变得不稳定。我们观察到，许多黑盒越狱攻击并未移除有害目标，而是重新组织了表达和执行该目标所需的信息，从而在逃逸安全对齐的同时，仍能在生成过程中被恢复。基于这一观察，我们提出了DoubtProbe，一种双分支推理时防御框架，将结构验证与语义审计相结合，将黑盒越狱防御形式化为受控变换下的一致性检查。结构分支从原始请求中提取结构化表示，在表示约束下重建请求，并检测原始请求与重建请求之间的信息保持失败；语义分支则直接审计原始提示。我们在越狱和良性请求基准上评估了DoubtProbe与代表性黑盒防御方法的对比效果，并进一步测试了从Qwen2.5-72B到Llama-3.1-70B的骨干迁移能力。结果表明，DoubtProbe实现了更强且更稳定的防御-效用权衡：在Qwen2.5-72B上，它将JBB攻击成功率从0.293降至0.100，将CodeAttack攻击成功率从0.152降至0.001，同时在AlpacaEval和OR-Bench上分别维持0.022和0.016的误报率；这一结果模式在Llama-3.1-70B上保持稳定。这些发现表明，结构不一致信号为黑盒越狱防御提供了实用且可泛化的基础，尤其是在与语义审计相结合时。