A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures

The large language models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LMMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and attacks without fine-tuning. Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.

翻译：大语言模型（LLMs）在人类语言理解与复杂问题解决之间架起了桥梁，在多项自然语言处理任务中取得了最先进的性能，尤其在少样本与零样本场景下表现突出。尽管LLMs已展现出显著效能，但由于计算资源的限制，用户不得不使用开源语言模型或将整个训练过程外包给第三方平台。然而，研究表明语言模型存在潜在的安全漏洞，尤其易受后门攻击的影响。后门攻击通过污染训练样本或模型权重，在语言模型中植入定向漏洞，使得攻击者能够通过恶意触发器操控模型响应。虽然现有的后门攻击综述提供了全面概述，但缺乏针对LLMs的后门攻击的深入探讨。为填补这一空白并把握该领域的最新趋势，本文通过聚焦微调方法，提出了针对LLMs后门攻击的新视角。具体而言，我们系统地将后门攻击分为三类：全参数微调、参数高效微调以及无需微调的攻击。基于大量文献综述的洞察，我们还探讨了后门攻击未来研究的关键问题，例如进一步探索无需微调的攻击算法，或开发更具隐蔽性的攻击算法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/