Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

翻译：大型语言模型（LLM）面临安全性与实用性之间的根本性权衡，这源于静态的、一刀切的安全策略缺乏运行时可控性，难以针对多样化的应用需求定制响应。%因此，模型可能过度拒绝良性请求或对有害请求约束不足。我们提出\textbf{PACT}（基于思维链的提示配置行动），这是一个通过显式、风险感知推理实现动态安全控制的框架。PACT在分层策略架构下运行：不可覆盖的全局安全策略为关键风险（例如儿童安全、暴力极端主义）建立不可变边界，而用户定义的策略可以引入领域特定（非全局）风险类别，并指定标签到行动的行为，以提升实际部署场景中的实用性。该框架将安全决策分解为结构化的分类→行动路径，将查询路由至适当的行动（遵从、引导或拒绝），并使决策过程透明化。大量实验表明，PACT在全局策略评估下实现了接近最先进水平的安全性能，同时在用户特定策略评估下获得了最佳的可控性，有效缓解了安全性与实用性的权衡。我们将发布PACT模型套件、训练数据和评估协议，以促进可控安全对齐的可复现研究。

相关内容

PACT

关注 0

PACT：International Conference on Parallel Architectures and Compilation Techniques。 Explanation：并行结构与编译技术国际会议。 Publisher：IEEE/ACM。 SIT： http://dblp.uni-trier.de/db/conf/IEEEpact/

管理 LLM 智能体中的演进式记忆：风险、机理及稳定性与安全性受控记忆（SSGM）框架

专知会员服务

15+阅读 · 3月14日

【AAAI2026】Align3GR：面向 LLM 生成式推荐的统一多层次对齐方法

专知会员服务

13+阅读 · 2025年11月17日

综述：面向移动端大语言模型的隐私与安全

专知会员服务

19+阅读 · 2025年9月7日

大型语言模型（LLM）智能体全栈安全的综述：数据、训练与部署

专知会员服务

32+阅读 · 2025年4月23日