We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.
翻译:我们提出CRAFT,一种红队对抗对齐框架,它利用模型推理能力和隐藏表征来提升对越狱攻击的鲁棒性。与以往主要在输出层面进行防御的方法不同,CRAFT通过显式优化定义在隐藏状态空间上的目标,使大型推理模型生成具有安全意识的推理轨迹。在方法论上,CRAFT将对比表征学习与强化学习相结合,以分离安全与不安全的推理轨迹,从而形成支持稳健、推理级安全对齐的潜在空间几何结构。在理论上,我们证明将潜在-文本一致性纳入GRPO可消除表面对齐策略,并将其视为局部最优解予以排除。在实证上,我们使用Qwen3-4B-Thinking和R1-Distill-Llama-8B这两个强推理模型,在多个安全基准上评估CRAFT,结果表明它始终优于IPO和SafeKey等最先进的防御方法。值得注意的是,与基础模型相比,CRAFT在推理安全方面平均提升79.0%,在最终响应安全方面平均提升87.7%,这证明了隐藏空间推理对齐的有效性。