Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.
翻译:大型语言模型(LLM)日益与基于激活的监控技术结合,以检测和预防表层文本中可能不明显的危害行为。然而,现有的激活安全方法基于广泛的误用数据集训练,存在精度不足、灵活性有限且缺乏可解释性的问题。本文提出一种新范式:规则驱动的激活安全机制,其灵感来源于网络安全领域的规则共享实践。我们将激活建模为认知要素(CE)——即细粒度、可解释的因子(如“发出威胁”和“支付处理”),这些要素可通过组合来更精确地捕捉特定领域的细微行为。基于此表征,我们提出一个实用框架,该框架定义基于认知要素的谓词规则并实时检测违规行为。这使得实践者能够在不重新训练模型或检测器的情况下配置和更新安全防护机制,同时支持透明性与可审计性。实验结果表明,基于组合规则的激活安全机制提升了检测精度,支持领域定制化,并为可扩展、可解释且可审计的人工智能治理奠定了基础。我们将开源GAVEL框架,并提供配套的自动化规则生成工具。