PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at https://huggingface.co/datasets/dynamoai/safe_eval.

翻译：部署语言模型（LM）要求其输出既高质量又符合安全准则。尽管推理时护栏（ITG）提供了将模型输出分布向合规性方向调整的解决方案，但我们发现现有方法在平衡安全性与有用性方面存在困难。能够安全处理不合规查询的ITG方法表现出较低的有用性，而优先考虑有用性的方法则会在安全性上做出妥协。我们将这种权衡称为护栏税，类似于对齐税。为解决此问题，我们提出了PrimeGuard，一种利用结构化控制流的新型ITG方法。PrimeGuard将请求路由至具有不同指令的LM自实例化版本，利用其固有的指令遵循能力和上下文学习能力。我们的免调优方法为每个查询动态编译系统设计者指南。我们构建并发布了safe-eval，一个多样化的红队安全基准测试集。广泛的评估表明，PrimeGuard无需微调即可克服护栏税，具体表现为：（1）显著提升对迭代越狱攻击的抵抗能力；（2）在安全护栏方面取得最先进的结果；（3）在有用性评分上与经过对齐调优的模型相当。广泛的评估进一步证明，PrimeGuard无需微调，在所有竞争基线中表现最优，并通过将最大模型的安全响应比例从61%提升至97%、平均有用性评分从4.17提高至4.29，同时将攻击成功率从100%降低至8%，从而克服了护栏税。PrimeGuard的实现代码可在 https://github.com/dynamofl/PrimeGuard 获取，safe-eval数据集可在 https://huggingface.co/datasets/dynamoai/safe_eval 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日