Detecting whether a model has been poisoned is a longstanding problem in AI security. In this work, we present a practical scanner for identifying sleeper agent-style backdoors in causal language models. Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques. Second, poisoned LLMs exhibit distinctive patterns in their output distributions and attention heads when backdoor triggers are present in the input. Guided by these observations, we develop a scalable backdoor scanning methodology that assumes no prior knowledge of the trigger or target behavior and requires only inference operations. Our scanner integrates naturally into broader defensive strategies and does not alter model performance. We show that our method recovers working triggers across multiple backdoor scenarios and a broad range of models and fine-tuning methods.
翻译:检测模型是否被投毒是人工智能安全领域长期存在的问题。本文提出一种实用的扫描器,用于识别因果语言模型中潜伏代理式后门。我们的方法基于两个关键发现:首先,潜伏代理倾向于记忆投毒数据,使得通过记忆提取技术泄露后门样本成为可能;其次,当输入中存在后门触发器时,被投毒的大型语言模型在输出分布和注意力头上会表现出独特模式。基于这些观察,我们开发了一种可扩展的后门扫描方法,该方法无需预先了解触发器或目标行为,仅需推理操作即可实现。我们的扫描器能自然地融入更广泛的防御策略,且不会改变模型性能。实验表明,我们的方法能够在多种后门场景、广泛的模型及微调方法中成功恢复有效触发器。