As LLMs become widespread across diverse applications, concerns about the security and safety of LLM interactions have intensified. Numerous guardrail models and benchmarks have been developed to ensure LLM content safety. However, existing guardrail benchmarks are often built upon ad hoc risk taxonomies that lack a principled grounding in standardized safety policies, limiting their alignment with real-world operational requirements. Moreover, they tend to overlook domain-specific risks, while the same risk category can carry different implications across different domains. To bridge these gaps, we introduce Poly-Guard, the first massive multi-domain safety policy-grounded guardrail dataset. Poly-Guard offers: (1) broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen; (2) policy-grounded risk construction based on authentic, domain-specific safety guidelines; (3) diverse interaction formats, encompassing declarative statements, questions, instructions, and multi-turn conversations; (4) advanced benign data curation via detoxification prompting to challenge over-refusal behaviors; and (5) \textbf{attack-enhanced instances} that simulate adversarial inputs designed to bypass guardrails. Based on Poly-Guard, we benchmark 19 advanced guardrail models and uncover a series of findings, such as: (1) All models achieve varied F1 scores, with many demonstrating high variance across risk categories, highlighting their limited domain coverage and insufficient handling of domain-specific safety concerns; (2) As models evolve, their coverage of safety risks broadens, but performance on common risk categories may decrease; (3) All models remain vulnerable to optimized adversarial attacks. We believe that \dataset and the unique insights derived from our evaluations will advance the development of policy-aligned and resilient guardrail systems.
翻译:随着大语言模型(LLM)在多样化应用中的广泛普及,对其交互安全性的担忧日益加剧。为确保LLM内容安全,众多护栏模型与评测基准相继被提出。然而,现有的护栏评测基准通常建立在临时构建的风险分类体系之上,缺乏对标准化安全策略的原则性依据,限制了其与现实世界运营需求的契合度。此外,这些基准往往忽视领域特异性风险,而同一风险类别在不同领域可能具有截然不同的含义。为弥补这些不足,我们提出了Poly-Guard——首个基于多领域安全策略的大规模护栏数据集。Poly-Guard具备以下特性:(1)覆盖金融、法律、代码生成等八个安全关键领域的广泛领域范围;(2)基于真实领域专用安全指南的策略驱动风险构建;(3)涵盖陈述句、疑问句、指令及多轮对话的多样化交互形式;(4)通过去毒提示技术进行高级良性数据构建,以挑战过度拒绝行为;(5)包含模拟对抗性输入的**攻击增强实例**,专门设计用于绕过护栏系统。基于Poly-Guard,我们对19个先进护栏模型进行了系统性评测,并揭示了一系列重要发现:(1)所有模型均呈现差异化的F1分数,多数模型在不同风险类别间表现出高方差,凸显其领域覆盖有限且对领域特定安全问题的处理不足;(2)随着模型迭代演进,其安全风险覆盖范围虽有所扩展,但在常见风险类别上的性能可能出现下降;(3)所有模型仍难以抵御经过优化的对抗攻击。我们相信,\dataset及其评估所获得的独特洞见将推动策略对齐且具备韧性的护栏系统的发展。