This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.
翻译:本文对安全对齐能力进行了全面的实证研究。我们评估了大型语言模型和大型推理模型中影响安全对齐的关键因素,为开发更安全可靠的人工智能系统提供了重要见解。我们系统性地研究并比较了六个关键内在模型特性和三种外部攻击技术的影响。我们的大规模评估使用了32个近期流行的大型语言模型和大型推理模型,涵盖十三个不同的模型系列,参数规模从30亿到2350亿不等。该评估利用五个成熟的安全数据集,并通过56种越狱技术和四种思维链攻击策略探测模型漏洞,共计产生了460万次API调用。我们的主要实证发现包括四个方面。首先,我们确定大型推理模型GPT-OSS-20B、Qwen3-Next-80B-A3B-Thinking和GPT-OSS-120B为安全性排名前三的模型,这证实了集成推理与自我反思机制对鲁棒安全对齐的显著优势。其次,后训练和知识蒸馏可能导致安全对齐的系统性退化。因此我们认为,在这些阶段必须将安全性视为显式约束或核心优化目标,而非仅仅从属于通用能力的追求。第三,我们揭示了一个显著的脆弱性:通过响应前缀实施思维链攻击,平均可将攻击成功率提升3.34倍,对于Seed-OSS-36B-Instruct模型甚至能从0.6%提升至96.3%。这一关键发现凸显了文本补全接口以及允许用户自定义响应前缀的大型语言模型服务功能所固有的安全风险,强调了架构和部署层面安全措施的迫切需求。第四,角色扮演、提示注入和基于梯度的对抗性提示搜索是诱发现代模型未对齐行为的主要方法。