Large language models (LLMs) remain vulnerable to misalignment and jailbreaks, making external safeguards like moderation filters essential, yet existing filters often focus narrowly on safety, falling short of the broader alignment needs seen in real-world deployments. We introduce Policy Aligned Moderation (PAM), a flexible framework for training custom moderation filters grounded in user-defined policies that extend beyond conventional safety objectives. PAM automates training data generation without relying on human-written examples, enabling scalable support for diverse, application-specific alignment goals and generation policies. PAM-trained filters match the performance of state-of-the-art safety moderation filters and policy reasoning models, and outperform them on PAMbench, four newly introduced user-annotated policy enforcement benchmarks that target age restrictions, dietary accommodations, cultural alignment, and limitations in medical guidance. These performance gains are achieved while the PAM filter runs 5-100x faster at inference than policy-conditioned reasoning models.
翻译:大型语言模型(LLMs)仍然容易受到错位和越狱攻击的影响,这使得外部保障措施(如审核过滤器)变得至关重要。然而,现有的过滤器通常只狭隘地关注安全性,未能满足现实世界部署中更广泛的对齐需求。我们提出了策略对齐审核(PAM),这是一个灵活的框架,用于训练基于用户定义策略的定制审核过滤器,这些策略超越了传统的安全目标。PAM无需依赖人工编写的示例即可自动生成训练数据,从而能够大规模支持多样化、特定于应用的对齐目标和生成策略。PAM训练的过滤器在性能上匹配最先进的安全审核过滤器和策略推理模型,并且在PAMbench(我们新引入的四个用户标注的策略执行基准测试,针对年龄限制、饮食适应、文化对齐和医疗指导限制)上表现更优。这些性能提升是在PAM过滤器在推理时比策略条件推理模型快5-100倍的情况下实现的。