YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

Junyu Lin,Meizhen Liu,Xiufeng Huang,Jinfeng Li,Haiwen Hong,Xiaohan Yuan,Yuefeng Chen,Longtao Huang,Hui Xue,Ranjie Duan,Zhikai Chen,Yuchuan Fu,Defeng Li,Lingyao Gao,Yitong Yang

As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.

翻译：随着大语言模型（LLM）在现实世界应用中的日益普及，安全护栏需要超越粗粒度的过滤，支持细粒度、可解释且可适应的风险评估。然而，现有解决方案通常依赖于快速分类方案或事后规则，导致透明度有限、策略不灵活或推理成本过高。为此，我们提出了YuFeng-XGuard，一个以推理为中心的护栏模型系列，旨在对LLM交互进行多维风险感知。YuFeng-XGuard并非产生不透明的二元判断，而是生成结构化的风险预测，包括明确的风险类别和可配置的置信度分数，并辅以揭示底层推理过程的自然语言解释。这种设计使得安全决策既具有可操作性，又具备可解释性。为了平衡决策延迟与解释深度，我们采用了一种分层推理范式：基于第一个解码令牌进行初步风险决策，同时在需要时保留按需解释性推理。此外，我们引入了一种动态策略机制，将风险感知与策略执行解耦，使得安全策略可以在无需模型重新训练的情况下进行调整。在多样化公共安全基准测试上进行的大量实验表明，YuFeng-XGuard在保持强效的效率-效能权衡的同时，实现了最先进的性能。我们将YuFeng-XGuard作为一个开放的模型系列发布，包括全容量变体和轻量级版本，以支持广泛的部署场景。