The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning approach to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various LLMs and tasks, our strategy demonstrates a high success rate in compromising model outputs; poisoning only 1\% of 4,000 instruction tuning samples leads to a Performance Drop Rate (PDR) of around 80\%. Our work highlights the need for stronger defenses against data poisoning attack, offering insights into safeguarding LLMs against these more sophisticated attacks. The source code can be found on this GitHub repository: https://github.com/RookieZxy/GBTL/blob/main/README.md.
翻译:大型语言模型(LLMs)的出现标志着语言处理与推理能力方面的重大突破。尽管取得了进展,LLMs仍面临数据投毒攻击的脆弱性——攻击者通过在训练数据中插入后门触发器来操控模型输出以实现恶意目的。本研究进一步通过设计一种针对指令微调过程的新型数据投毒攻击,揭示了LLMs中额外的安全风险。我们提出一种新颖的梯度引导后门触发器学习方法,能够高效识别对抗性触发器,在保持内容完整性的同时规避传统防御机制的检测。通过在多种LLMs和任务上的实验验证,我们的策略在破坏模型输出方面展现出高成功率:仅需对4000个指令微调样本中的1%进行投毒,即可导致约80%的性能下降率(PDR)。本工作强调了针对数据投毒攻击加强防御的必要性,为保护LLMs免受此类更复杂攻击提供了见解。源代码可在GitHub仓库中获取:https://github.com/RookieZxy/GBTL/blob/main/README.md。