Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield

Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks, which can prompt these systems to produce harmful responses. In the heart of these systems lies a safety classifier, a computational model trained to discern and mitigate potentially harmful, offensive, or unethical outputs. However, contemporary safety classifiers, despite their potential, often fail when exposed to inputs infused with adversarial noise. In response, our study introduces the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts. Additionally, we propose novel strategies for autonomously generating adversarial training datasets, named Bot Adversarial Noisy Dialogue (BAND) datasets. These datasets are designed to fortify the safety classifier's robustness, and we investigate the consequences of incorporating adversarial examples into the training process. Through evaluations involving Large Language Models, we demonstrate that our classifier has the potential to decrease the attack success rate resulting from adversarial attacks by up to 60%. This advancement paves the way for the next generation of more reliable and resilient conversational agents.

翻译：大型语言模型的安全性仍是一个关键问题，因其易受对抗性攻击影响，此类攻击可能诱使系统产生有害回应。这类安全系统的核心是安全分类器——一种经过训练的计算模型，旨在识别并缓解潜在的有害、冒犯或不道德输出。然而，当代安全分类器虽具潜力，但在面对注入对抗性噪声的输入时往往失效。为此，本研究提出对抗性提示防护盾（APS），这是一种轻量级模型，既具备卓越的检测精度，又展现出对抗对抗性提示的弹性。此外，我们创新性地提出了自主生成对抗性训练数据集的策略，即机器人对抗性噪声对话（BAND）数据集。这些数据集旨在强化安全分类器的鲁棒性，我们同时探究了将对抗性样本纳入训练过程的影响。通过涉及大型语言模型的评估，我们证实该分类器能将对抗性攻击导致的攻击成功率降低60%。这一进展为开发更可靠、更具弹性的下一代对话智能体铺平了道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日