Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks, which can prompt these systems to produce harmful responses. In the heart of these systems lies a safety classifier, a computational model trained to discern and mitigate potentially harmful, offensive, or unethical outputs. However, contemporary safety classifiers, despite their potential, often fail when exposed to inputs infused with adversarial noise. In response, our study introduces the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts. Additionally, we propose novel strategies for autonomously generating adversarial training datasets, named Bot Adversarial Noisy Dialogue (BAND) datasets. These datasets are designed to fortify the safety classifier's robustness, and we investigate the consequences of incorporating adversarial examples into the training process. Through evaluations involving Large Language Models, we demonstrate that our classifier has the potential to decrease the attack success rate resulting from adversarial attacks by up to 60%. This advancement paves the way for the next generation of more reliable and resilient conversational agents.
翻译:大型语言模型的安全性仍是一个关键问题,因其易受对抗性攻击影响,此类攻击可能诱使系统产生有害回应。这类安全系统的核心是安全分类器——一种经过训练的计算模型,旨在识别并缓解潜在的有害、冒犯或不道德输出。然而,当代安全分类器虽具潜力,但在面对注入对抗性噪声的输入时往往失效。为此,本研究提出对抗性提示防护盾(APS),这是一种轻量级模型,既具备卓越的检测精度,又展现出对抗对抗性提示的弹性。此外,我们创新性地提出了自主生成对抗性训练数据集的策略,即机器人对抗性噪声对话(BAND)数据集。这些数据集旨在强化安全分类器的鲁棒性,我们同时探究了将对抗性样本纳入训练过程的影响。通过涉及大型语言模型的评估,我们证实该分类器能将对抗性攻击导致的攻击成功率降低60%。这一进展为开发更可靠、更具弹性的下一代对话智能体铺平了道路。