Existing black-box jailbreak attacks achieve certain success on non-reasoning models but degrade significantly on recent SOTA reasoning models. To improve attack ability, inspired by adversarial aggregation strategies, we integrate multiple jailbreak tricks into a single developer template. Especially, we apply Adversarial Context Alignment to purge semantic inconsistencies and use NTP (a type of harmful prompt) -based few-shot examples to guide malicious outputs, lastly forming DH-CoT attack with a fake chain of thought. In experiments, we further observe that existing red-teaming datasets include samples unsuitable for evaluating attack gains, such as BPs, NHPs, and NTPs. Such data hinders accurate evaluation of true attack effect lifts. To address this, we introduce MDH, a Malicious content Detection framework integrating LLM-based annotation with Human assistance, with which we clean data and build RTA dataset suite. Experiments show that MDH reliably filters low-quality samples and that DH-CoT effectively jailbreaks models including GPT-5 and Claude-4, notably outperforming SOTA methods like H-CoT and TAP.
翻译:现有黑盒越狱攻击在非推理模型上取得一定成功,但在近期最先进的推理模型上效果显著下降。为提升攻击能力,受对抗性聚合策略启发,我们将多种越狱技巧集成至单一开发者模板。特别地,我们采用对抗性上下文对齐消除语义不一致性,并利用基于NTP(一类有害提示)的少样本示例引导恶意输出,最终通过伪造思维链形成DH-CoT攻击。实验中进一步观察到,现有红队数据集包含不适用于评估攻击增益的样本,如BPs、NHPs和NTPs。此类数据阻碍了对真实攻击效果提升的准确评估。为此,我们提出MDH——一个融合基于大语言模型的标注与人工辅助的恶意内容检测框架,借此清洗数据并构建RTA数据集套件。实验表明,MDH能可靠过滤低质量样本,且DH-CoT能有效越狱包括GPT-5与Claude-4在内的模型,其性能显著优于H-CoT、TAP等最先进方法。