Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.
翻译:护栏分类器用于防御生产环境中的语言模型免受有害行为的影响,但尽管测试结果看似可靠,它们并未提供形式化保证。为此类模型提供形式化保证的难点在于,“有害行为”在离散输入空间中缺乏自然规范,且其他领域使用的标准ε-球性质不具备语义含义。我们通过将验证从离散输入空间转移到分类器的预激活空间来弥合这一差距,在该空间中,我们将有害区域定义为包围已知有害提示表示的凸形状。由于sigmoid分类头是单调的,验证最坏情况点足以验证整个区域,从而在O(d)时间内无需近似即可得到封闭形式的形式证明。为正式评估这些分类器,我们提出两种此类区域的构造方法:SVD对齐超矩形(可产生精确的SAT/UNSAT证书)和高斯混合模型(可在语义一致聚类上产生概率证书)。将这一框架应用于三个作者训练的毒性领域护栏分类器时,每个超矩形配置均返回SAT,揭示了所有分类器中存在的可验证安全漏洞,尽管实证指标看似较高。概率GMM证书还揭示了这些模型表示有害行为时结构稳定性的差异。当GPT-2和Llama-3.1-8B在不同边界下保持90%和80%的稳健覆盖率时,BERT的安全保证显示出独特的不稳定性。这种“覆盖率崩溃”在最优阈值处降至55%,揭示了BERT稀疏的安全裕度,唯有采用极度保守的悲观阈值才能实现完全覆盖。这些方法的结合,提供了超越传统红队测试的新见解,揭示了护栏分类器实际效能的真实水平。