Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models vulnerable to disguised attacks (e.g., prefilling) while degrading utility. To bridge this gap, we propose SafeThinker, an adaptive framework that dynamically allocates defensive resources via a lightweight gateway classifier. Based on the gateway's risk assessment, inputs are routed through three distinct mechanisms: (i) a Standardized Refusal Mechanism for explicit threats to maximize efficiency; (ii) a Safety-Aware Twin Expert (SATE) module to intercept deceptive attacks masquerading as benign queries; and (iii) a Distribution-Guided Think (DDGT) component that adaptively intervenes during uncertain generation. Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising utility, demonstrating that coordinating intrinsic judgment throughout the generation process effectively balances robustness and practicality.
翻译:尽管大型语言模型(LLMs)具有内在的风险感知能力,但现有防御机制往往导致浅层安全对齐,使模型易受伪装攻击(如预填充)的影响,同时降低其实用性。为弥合这一差距,我们提出SafeThinker——一种通过轻量级网关分类器动态分配防御资源的自适应框架。基于网关的风险评估,输入被路由至三种不同机制:(i)针对显式威胁的标准化拒绝机制,以最大化效率;(ii)安全感知孪生专家(SATE)模块,用于拦截伪装为良性查询的欺骗性攻击;(iii)分布引导思考(DDGT)组件,在不确定生成过程中进行自适应干预。实验表明,SafeThinker能显著降低多种越狱策略的攻击成功率,且不影响实用性,这证明在生成过程中协调内在判断能有效平衡鲁棒性与实用性。