Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard achieves a near 100\% true positive rate (TPR) and a negligible false positive rate (FPR) in the vast majority of cases. Crucially, the ConfGuard enables real-time detection almost without additional latency, making it a practical backdoor defense for real-world LLM deployments.
翻译:后门攻击对大型语言模型构成重大威胁,攻击者可通过嵌入隐藏触发器来操纵LLM的输出。现有防御方法主要针对分类任务设计,由于无法适应LLM的自回归特性和巨大输出空间,普遍存在性能低下和延迟过高的问题。为突破这些局限,本研究深入探究了良性LLM与后门LLM在输出空间中的行为差异。我们发现了关键现象——序列锁定:与良性生成相比,后门模型生成目标序列时具有异常高且稳定的置信度。基于此发现,我们提出ConfGuard——一种轻量级高效检测方法,通过监控词元置信度的滑动窗口来识别序列锁定现象。大量实验表明,在绝大多数情况下ConfGuard能实现接近100%的真阳性率,同时保持可忽略的假阳性率。尤为关键的是,该方法支持近乎零额外延迟的实时检测,为实际场景中的LLM部署提供了切实可行的后门防御方案。