Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.

翻译：护栏分类器用于防御生产环境中的语言模型免受有害行为的影响，但尽管测试结果看似可靠，它们并未提供形式化保证。为此类模型提供形式化保证的难点在于，“有害行为”在离散输入空间中缺乏自然规范，且其他领域使用的标准ε-球性质不具备语义含义。我们通过将验证从离散输入空间转移到分类器的预激活空间来弥合这一差距，在该空间中，我们将有害区域定义为包围已知有害提示表示的凸形状。由于sigmoid分类头是单调的，验证最坏情况点足以验证整个区域，从而在O(d)时间内无需近似即可得到封闭形式的形式证明。为正式评估这些分类器，我们提出两种此类区域的构造方法：SVD对齐超矩形（可产生精确的SAT/UNSAT证书）和高斯混合模型（可在语义一致聚类上产生概率证书）。将这一框架应用于三个作者训练的毒性领域护栏分类器时，每个超矩形配置均返回SAT，揭示了所有分类器中存在的可验证安全漏洞，尽管实证指标看似较高。概率GMM证书还揭示了这些模型表示有害行为时结构稳定性的差异。当GPT-2和Llama-3.1-8B在不同边界下保持90%和80%的稳健覆盖率时，BERT的安全保证显示出独特的不稳定性。这种“覆盖率崩溃”在最优阈值处降至55%，揭示了BERT稀疏的安全裕度，唯有采用极度保守的悲观阈值才能实现完全覆盖。这些方法的结合，提供了超越传统红队测试的新见解，揭示了护栏分类器实际效能的真实水平。

相关内容

分类器

关注 6

分类是数据挖掘的一种非常重要的方法。分类的概念是在已有数据的基础上学会一个分类函数或构造出一个分类模型（即我们通常所说的分类器(Classifier)）。该函数或模型能够把数据库中的数据纪录映射到给定类别中的某一个，从而可以应用于数据预测。总之，分类器是数据挖掘中对样本进行分类的方法的统称，包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。

用于单元测试生成的大型语言模型：成果、挑战与未来方向

专知会员服务

17+阅读 · 2025年11月27日

《大语言模型驱动的智能红队测试》

专知会员服务

18+阅读 · 2025年11月26日

《探索大型语言模型在军事联盟网络红队中的应用潜力》最新论文

专知会员服务

31+阅读 · 2025年1月5日

《红队分析工具中的混淆、隐蔽和非归属问题》最新79页

专知会员服务

18+阅读 · 2024年11月17日