Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.
翻译:尽管现代大语言模型在训练后阶段已与人类价值观对齐,但在部署时仍需进行鲁棒的内容审核以防止有害输出。现有方法存在性能与效率的权衡问题,且难以适应用户特定需求。基于此研究空白,我们提出了多层原型审核器(MLPM)——一种轻量级且高度可定制的输入审核工具。该方法通过利用多层中间表征的原型来提升审核质量,同时保持高效率。该设计使我们的方法在生成流程中仅增加可忽略的开销,并能无缝应用于任何模型。MLPM在多样化审核基准测试中实现了最先进的性能,并在不同规模的模型系列中展现出强大的可扩展性。此外,研究表明该方法能平滑集成至端到端审核流程,当与输出审核技术结合使用时能进一步提升响应安全性。总体而言,本研究为安全、鲁棒且高效的大语言模型部署提供了实用且适应性强的解决方案。