What Matters For Safety Alignment?

This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

翻译：本文对安全对齐能力进行了全面的实证研究。我们评估了影响大语言模型（LLM）和大推理模型（LRM）安全对齐的关键因素，为开发更安全可靠的人工智能系统提供了重要见解。我们系统性地研究并比较了六种关键的内在模型特性和三种外部攻击技术的影响。我们的大规模评估使用了32个近期流行的LLM和LRM，涵盖十三个不同的模型系列，参数规模从30亿到2350亿不等。该评估利用了五个成熟的安全数据集，并通过56种越狱技术和四种思维链（CoT）攻击策略探测模型漏洞，共计产生了460万次API调用。我们的主要实证发现有以下四点。首先，我们确定了LRM模型GPT-OSS-20B、Qwen3-Next-80B-A3B-Thinking和GPT-OSS-120B为最安全的三个模型，这证实了集成推理和自我反思机制对于实现稳健的安全对齐具有显著优势。其次，后训练和知识蒸馏可能导致安全对齐的系统性退化。因此我们认为，在这些阶段必须将安全性视为明确的约束或核心优化目标，而不仅仅是服从于追求通用能力的附属品。第三，我们揭示了一个显著的漏洞：通过响应前缀使用CoT攻击，可以将攻击成功率平均提升3.34倍，对于Seed-OSS-36B-Instruct模型，攻击成功率可从0.6%提升至96.3%。这一关键发现突显了文本补全接口以及允许用户自定义响应前缀的LLM服务功能所固有的安全风险，强调了在架构和部署层面实施保障措施的迫切性。第四，角色扮演、提示注入和基于梯度的对抗性提示搜索，是诱发现代模型产生未对齐行为的主要方法。