Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
翻译:护栏已作为大语言模型内容审核中安全对齐的一种替代方案出现。现有基于模型的护栏并未针对资源受限的计算便携设备(如手机)进行设计,而越来越多的此类设备正在本地运行基于大语言模型的应用程序。我们提出了LoRA-Guard,一种参数高效的护栏适配方法,其依赖于大语言模型与护栏模型之间的知识共享。LoRA-Guard从大语言模型中提取语言特征,并通过低秩适配器将其适配用于内容审核任务,同时采用双路径设计以避免生成任务上的任何性能下降。我们证明,LoRA-Guard在保持准确性的同时,以低于现有方法100-1000倍的参数量开销实现了更优性能,从而支持设备端内容审核。