The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during the instruction tuning of LLMs and emphasizes the necessity of safeguarding LLMs against data poisoning attacks.
翻译:大型语言模型(LLMs)的出现标志着语言处理和推理能力取得了重大成就。尽管取得了这些进展,LLMs仍然面临数据投毒攻击的脆弱性,攻击者可将后门触发器植入训练数据,以恶意目的操纵模型输出。本研究通过设计一种专门针对指令微调过程的新型数据投毒攻击,进一步揭示了LLMs中存在的额外安全风险。我们提出了一种新颖的梯度引导后门触发器学习(GBTL)算法,以高效识别对抗性触发器,确保其能规避传统防御机制的检测,同时保持内容完整性。通过在情感分析、领域生成和问答等多种任务上的实验验证,我们的投毒策略在破坏各类LLMs输出方面表现出高成功率。我们进一步提出了两种针对数据投毒攻击的防御策略,包括上下文学习(ICL)和持续学习(CL),这些策略能有效纠正LLMs的行为,并显著减少性能下降。本研究凸显了LLMs指令微调过程中存在的重大安全风险,并强调了保护LLMs免受数据投毒攻击的必要性。