Large language models (LLMs) have exhibited outstanding performance in natural language processing tasks. However, these models remain susceptible to adversarial attacks in which slight input perturbations can lead to harmful or misleading outputs. A gradient-based defensive suffix generation algorithm is designed to bolster the robustness of LLMs. By appending carefully optimized defensive suffixes to input prompts, the algorithm mitigates adversarial influences while preserving the models' utility. To enhance adversarial understanding, a novel total loss function ($L_{\text{total}}$) combining defensive loss ($L_{\text{def}}$) and adversarial loss ($L_{\text{adv}}$) generates defensive suffixes more effectively. Experimental evaluations conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and Llama2-13B show that the proposed method reduces attack success rates (ASR) by an average of 11\% compared to models without defensive suffixes. Additionally, the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations demonstrate consistent improvements with Truthfulness scores increasing by up to 10\% across tested configurations. This approach significantly enhances the security of LLMs in critical applications without requiring extensive retraining.
翻译:大语言模型在自然语言处理任务中展现出卓越性能。然而,这些模型仍易受对抗攻击的影响,即微小的输入扰动可能导致有害或误导性输出。本文设计了一种基于梯度的防御性后缀生成算法,旨在增强大语言模型的鲁棒性。该算法通过向输入提示词附加经过精心优化的防御性后缀,在保持模型实用性的同时有效缓解对抗性影响。为提升对抗理解能力,我们提出了一种结合防御损失与对抗损失的新型总损失函数,该函数能更有效地生成防御性后缀。在Gemma-7B、mistral-7B、Llama2-7B和Llama2-13B等开源大语言模型上进行的实验评估表明:相较于未使用防御性后缀的模型,所提方法使攻击成功率平均降低11%。此外,当应用由openELM-270M生成的防御性后缀时,Gemma-7B的困惑度分数从6.57降至3.93。TruthfulQA评估进一步显示,在所有测试配置中真实性评分最高提升10%。该方法无需大量重新训练即可显著提升大语言模型在关键应用中的安全性。