Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
翻译:护栏已成为大型语言模型内容审核中安全对齐的一种替代方案。现有的基于模型的护栏并未针对资源受限的计算便携设备(如移动电话)进行设计,而越来越多的此类设备正在本地运行基于LLM的应用程序。我们提出了LoRA-Guard,一种参数高效的护栏适配方法,该方法依赖于LLM与护栏模型之间的知识共享。LoRA-Guard从LLM中提取语言特征,并使用低秩适配器使其适应内容审核任务,同时通过双路径设计防止生成任务出现任何性能下降。我们证明,LoRA-Guard在保持准确性的同时,以低于现有方法100-1000倍的参数开销实现更优性能,从而支持设备端内容审核。