Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
翻译:大型语言模型(LLM)面临安全性与实用性之间的根本性权衡,这源于静态的、一刀切的安全策略缺乏运行时可控性,难以针对多样化的应用需求定制响应。%因此,模型可能过度拒绝良性请求或对有害请求约束不足。我们提出\textbf{PACT}(基于思维链的提示配置行动),这是一个通过显式、风险感知推理实现动态安全控制的框架。PACT在分层策略架构下运行:不可覆盖的全局安全策略为关键风险(例如儿童安全、暴力极端主义)建立不可变边界,而用户定义的策略可以引入领域特定(非全局)风险类别,并指定标签到行动的行为,以提升实际部署场景中的实用性。该框架将安全决策分解为结构化的分类→行动路径,将查询路由至适当的行动(遵从、引导或拒绝),并使决策过程透明化。大量实验表明,PACT在全局策略评估下实现了接近最先进水平的安全性能,同时在用户特定策略评估下获得了最佳的可控性,有效缓解了安全性与实用性的权衡。我们将发布PACT模型套件、训练数据和评估协议,以促进可控安全对齐的可复现研究。