Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.
翻译:后门攻击对大语言模型(LLMs)构成重大威胁,攻击者可嵌入隐藏触发器以操纵LLM的输出。现有防御方法主要针对分类任务设计,对LLM的自回归特性和巨大输出空间效果有限,存在性能低下和延迟高的问题。为解决这些局限性,我们研究了良性LLM与后门LLM在输出空间中的行为差异。我们发现了一个关键现象,称为序列锁定:与良性生成相比,后门模型以异常高且一致的置信度生成目标序列。基于此发现,我们提出ConfGuard——一种轻量级高效检测方法,通过监控词元置信度的滑动窗口来识别序列锁定。大量实验表明,在绝大多数情况下ConfGuard能实现接近100%的真阳性率(TPR)和可忽略的假阳性率(FPR)。关键的是,ConfGuard几乎无需额外延迟即可实现实时检测,这使其成为实际LLM部署中实用的后门防御方案。