Backdoor attacks present significant threats to Large Language Models (LLMs), particularly with the rise of third-party services that offer API integration and prompt engineering. Untrustworthy third parties can plant backdoors into LLMs and pose risks to users by embedding malicious instructions into user queries. The backdoor-compromised LLM will generate malicious output when and input is embedded with a specific trigger predetermined by an attacker. Traditional defense strategies, which primarily involve model parameter fine-tuning and gradient calculation, are inadequate for LLMs due to their extensive computational and clean data requirements. In this paper, we propose a novel solution, Chain-of-Scrutiny (CoS), to address these challenges. Backdoor attacks fundamentally create a shortcut from the trigger to the target output, thus lack reasoning support. Accordingly, CoS guides the LLMs to generate detailed reasoning steps for the input, then scrutinizes the reasoning process to ensure consistency with the final answer. Any inconsistency may indicate an attack. CoS only requires black-box access to LLM, offering a practical defense, particularly for API-accessible LLMs. It is user-friendly, enabling users to conduct the defense themselves. Driven by natural language, the entire defense process is transparent to users. We validate the effectiveness of CoS through extensive experiments across various tasks and LLMs. Additionally, experiments results shows CoS proves more beneficial for more powerful LLMs.
翻译:后门攻击对大型语言模型(LLMs)构成重大威胁,尤其是在提供API集成和提示工程服务的第三方平台兴起之际。不可信的第三方可能将后门植入LLMs,并通过在用户查询中嵌入恶意指令对用户构成风险。当输入包含攻击者预设的特定触发器时,受后门危害的LLM将生成恶意输出。传统防御策略主要依赖模型参数微调和梯度计算,但由于其对大量计算资源和清洁数据的需求,难以适用于LLMs。本文提出一种新颖解决方案——链式审查(CoS)——以应对这些挑战。后门攻击本质上在触发器与目标输出之间建立了捷径,因而缺乏推理支持。相应地,CoS引导LLMs为输入生成详细推理步骤,随后审查推理过程以确保其与最终答案的一致性。任何不一致都可能表明攻击存在。CoS仅需对LLM进行黑盒访问,为API可访问的LLMs提供了实用防御方案。该方法用户友好,允许用户自行实施防御。整个防御过程由自然语言驱动,对用户保持透明。我们通过跨多种任务和LLMs的广泛实验验证了CoS的有效性。此外,实验结果表明CoS对更强大的LLMs具有更显著的防御效益。