Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.
翻译:当前的大语言模型安全方法从根本上依赖于一种脆弱的猫鼠游戏,即通过护栏识别并阻断已知威胁。我们主张一种新方法:稳健的安全性并非来自枚举有害内容,而是源于深刻理解何为安全内容。我们引入“信任典型性”框架,该框架通过将安全性视为分布外检测问题来实践这一原则。T3在语义空间中学习可接受提示的分布,并将任何显著偏离标记为潜在威胁。与先前方法不同,它无需在有害示例上进行训练,却在涵盖毒性、仇恨言论、越狱攻击、多语言危害和过度拒绝的18个基准测试中实现了最先进的性能,相较于专用安全模型将误报率降低了高达40倍。仅使用安全英文文本训练的单一模型无需重新训练即可有效迁移至不同领域和超过14种语言。最后,我们通过将GPU优化版本集成到vLLM中展示了生产就绪性,即使在大规模工作负载的密集评估间隔下,也能在令牌生成过程中实现持续护栏保护,且开销低于6%。