Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
翻译:大语言模型(LLM)通过链式思维(COT)提示方法展现出显著的性能提升,尤其在需要系统性推理过程的任务中。然而,链式思维提示也引入了新的后门攻击漏洞——当模型在推理阶段遇到特定后门触发条件时,会输出恶意非预期内容。传统后门攻击方法需要污染训练数据集或直接操控部署阶段的模型参数,但这些方法对通常通过API接口运行的商业大语言模型并不适用。本文提出BadChain——首个针对大语言模型链式思维提示的后门攻击方法,该方法无需访问训练数据集或模型参数,且计算开销极低。BadChain利用大语言模型固有的推理能力,在模型输出的推理步骤序列中植入后门推理步骤,从而在查询提示包含后门触发条件时改变最终响应。实验表明,BadChain在四种大语言模型(Llama2、GPT-3.5、PaLM2和GPT-4)以及涵盖算术推理、常识推理和符号推理的六个复杂基准任务中均展现出有效性。值得注意的是,具有更强推理能力的LLM(如GPT-4)对BadChain的敏感性更高,其在六个基准任务上的平均攻击成功率高达97.0%。最后,我们提出两种基于随机重排的防御方法,但实验证明这些方法对BadChain整体无效。因此,BadChain对大语言模型构成严重威胁,亟需开发鲁棒且有效的防御方案。