Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.
翻译:大型语言模型(LLM)日益与基于激活的监控技术相结合,以检测和预防表层文本中可能不明显的危害行为。然而,现有的激活安全方法基于广泛的误用数据集进行训练,存在精度不足、灵活性有限且缺乏可解释性的问题。本文提出一种新范式:规则化激活安全,其灵感来源于网络安全领域的规则共享实践。我们将激活建模为认知要素(CE)——即细粒度、可解释的因子(例如“发出威胁”和“支付处理”),这些要素可通过组合来更精确地捕捉细微且领域特定的行为。基于此表征,我们提出一个实用框架,该框架定义针对认知要素的谓词规则并实时检测违规行为。这使得实践者无需重新训练模型或检测器即可配置和更新安全措施,同时支持透明性与可审计性。实验结果表明,基于组合规则的激活安全方法提升了检测精度,支持领域定制,并为可扩展、可解释且可审计的人工智能治理奠定了基础。我们将以开源框架形式发布GAVEL,并提供配套的自动化规则生成工具。