Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model's safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.

翻译：当前大型语言模型（LLMs）的内容安全方法，如监督微调（SFT）和基于人类反馈的强化学习（RLHF），通常依赖于多阶段训练流程，且缺乏细粒度的部署后可控性。为克服这些局限，我们提出了一种统一的协同训练框架，能在单一SFT阶段内高效整合多种安全行为：积极型（合法/亲社会）、消极型（无过滤/高风险倾向）和拒绝型（拒绝导向/保守型）。值得注意的是，每种行为均可通过简单的系统级指令（即魔法令牌）动态激活，从而在推理时实现隐蔽且高效的行为切换。这种灵活性支持多样化的部署场景，例如：积极型用于安全的用户交互，消极型用于内部红队测试，拒绝型用于响应上游审核信号而触发的上下文感知拒绝。该协同训练策略在输出空间中诱导出独特的安全对齐边界，其特征表现为对应于各安全模式的、分离良好的响应分布。该边界的存在为模型的安全鲁棒性提供了实证依据，并实现了前所未有的细粒度控制。实验表明，我们的方法在安全对齐质量上可匹敌SFT+DPO，其中我们的80亿参数模型在安全性能上显著超越了DeepSeek-R1（6710亿参数），同时大幅降低了训练复杂度和部署成本。本研究为LLM内容安全提供了一种可扩展、高效且高度可控的解决方案。