Customized Large Language Model (LLM) agents face a critical security threat from black-box instruction backdoors, where malicious behaviors are covertly injected through hidden system instructions. Although existing prompt-based defenses can often detect poisoned inputs, they generally fail to recover correct outputs once the backdoor is activated. In this paper, we first conduct a mechanistic analysis of LLM behavior under instruction backdoors and reveal two pivotal phenomena: (1) cognitive override, in which backdoor triggers dominate the reasoning process and suppress task-relevant context, and (2) abnormal semantic correlation, where triggers establish excessively strong semantic associations with attacker-specified target labels. Based on these insights, we propose a $\textbf{S}$oft $\textbf{L}$abel mechanism and key-extraction-guided CoT-based defense against $\textbf{I}$nstruction backdoors in A$\textbf{P}$Is (SLIP). To counteract the cognitive override, the key-extraction-guided Chain-of-Thought (KCOT) explicitly guides the model to extract task-relevant keywords and phrases rather than only considering the single trigger or overall text semantics. To neutralize the trigger's abnormal semantic correlation, the soft label mechanism (SLM) quantifies semantic correlations and employs statistical clustering to filter anomalous phrases before aggregating reliable keywords and phrases for prediction. Extensive experiments show that SLIP reduces the average attack success rate to 25.13$\%$, improves clean accuracy to 87.15$\%$, and outperforms state-of-the-art black-box defenses.
翻译:定制化大语言模型(LLM)代理面临黑盒指令后门的严重安全威胁,其中恶意行为通过隐藏的系统指令被隐蔽注入。尽管现有的基于提示的防御方法通常能够检测中毒输入,但一旦后门被激活,它们往往无法恢复正确输出。本文首先对指令后门下的LLM行为进行机制分析,揭示两个关键现象:(1)认知覆盖,即后门触发器主导推理过程并抑制任务相关上下文;(2)异常语义关联,即触发器与攻击者指定的目标标签建立过强的语义联系。基于这些发现,我们提出了一种针对API中指令后门的软标签机制与关键提取引导的CoT防御方法(SLIP)。为应对认知覆盖,关键提取引导的思维链(KCOT)明确引导模型提取任务相关的关键词和短语,而非仅考虑单一触发器或整体文本语义。为中和触发器的异常语义关联,软标签机制(SLM)量化语义关联,并利用统计聚类过滤异常短语,然后聚合可靠的关键词和短语进行预测。大量实验表明,SLIP将平均攻击成功率降至25.13%,干净准确率提升至87.15%,并优于最先进的黑盒防御方法。