Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation

Large language models (LLMs) have exhibited outstanding performance in natural language processing tasks. However, these models remain susceptible to adversarial attacks in which slight input perturbations can lead to harmful or misleading outputs. A gradient-based defensive suffix generation algorithm is designed to bolster the robustness of LLMs. By appending carefully optimized defensive suffixes to input prompts, the algorithm mitigates adversarial influences while preserving the models' utility. To enhance adversarial understanding, a novel total loss function ($L_{\text{total}}$) combining defensive loss ($L_{\text{def}}$) and adversarial loss ($L_{\text{adv}}$) generates defensive suffixes more effectively. Experimental evaluations conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and Llama2-13B show that the proposed method reduces attack success rates (ASR) by an average of 11\% compared to models without defensive suffixes. Additionally, the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations demonstrate consistent improvements with Truthfulness scores increasing by up to 10\% across tested configurations. This approach significantly enhances the security of LLMs in critical applications without requiring extensive retraining.

翻译：大语言模型在自然语言处理任务中展现出卓越性能。然而，这些模型仍易受对抗攻击的影响，即微小的输入扰动可能导致有害或误导性输出。本文设计了一种基于梯度的防御性后缀生成算法，旨在增强大语言模型的鲁棒性。该算法通过向输入提示词附加经过精心优化的防御性后缀，在保持模型实用性的同时有效缓解对抗性影响。为提升对抗理解能力，我们提出了一种结合防御损失与对抗损失的新型总损失函数，该函数能更有效地生成防御性后缀。在Gemma-7B、mistral-7B、Llama2-7B和Llama2-13B等开源大语言模型上进行的实验评估表明：相较于未使用防御性后缀的模型，所提方法使攻击成功率平均降低11%。此外，当应用由openELM-270M生成的防御性后缀时，Gemma-7B的困惑度分数从6.57降至3.93。TruthfulQA评估进一步显示，在所有测试配置中真实性评分最高提升10%。该方法无需大量重新训练即可显著提升大语言模型在关键应用中的安全性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日