Backdoor Sentinel: Detecting and Detoxifying Backdoors in Diffusion Models via Temporal Noise Consistency

Diffusion models have been widely deployed in AIGC services; however, their reliance on opaque training data and procedures exposes a broad attack surface for backdoor injection. In practical auditing scenarios, due to the protection of intellectual property and commercial confidentiality, auditors are typically unable to access model parameters, rendering existing white-box or query-intensive detection methods impractical. More importantly, even after the backdoor is detected, existing detoxification approaches are often trapped in a dilemma between detoxification effectiveness and generation quality. In this work, we identify a previously unreported phenomenon called temporal noise unconsistency, where the noise predictions between adjacent diffusion timesteps is disrupted in specific temporal segments when the input is triggered, while remaining stable under clean inputs. Leveraging this finding, we propose Temporal Noise Consistency Defense (TNC-Defense), a unified framework for backdoor detection and detoxification. The framework first uses the adjacent timestep noise consistency to design a gray-box detection module, for identifying and locating anomalous diffusion timesteps. Furthermore, the framework uses the identified anomalous timesteps to construct a trigger-agnostic, timestep-aware detoxification module, which directly corrects the backdoor generation path. This effectively suppresses backdoor behavior while significantly reducing detoxification costs. We evaluate the proposed method under five representative backdoor attack scenarios and compare it with state-of-the-art defenses. The results show that TNC-Defense improves the average detection accuracy by $11\%$ with negligible additional overhead, and invalidates an average of $98.5\%$ of triggered samples with only a mild degradation in generation quality.

翻译：扩散模型已广泛应用于AIGC服务中，但其对不透明训练数据和流程的依赖，为后门植入提供了广阔的攻击面。在实际审计场景中，由于知识产权和商业机密的保护，审计者通常无法获取模型参数，这使得现有的白盒或高查询量检测方法难以实施。更重要的是，即使检测到后门，现有的净化方法也常陷入净化效果与生成质量难以兼顾的困境。本工作中，我们发现了一种先前未被报告的现象——时序噪声不一致性：当输入包含触发器时，相邻扩散时间步间的噪声预测在特定时间片段会受到干扰，而在干净输入下则保持稳定。基于这一发现，我们提出了时序噪声一致性防御（TNC-Defense），一个用于后门检测与净化的统一框架。该框架首先利用相邻时间步噪声一致性设计了一个灰盒检测模块，用于识别并定位异常的扩散时间步。进而，框架利用识别出的异常时间步构建了一个与触发器无关、感知时间步的净化模块，直接修正后门生成路径。这有效抑制了后门行为，同时显著降低了净化成本。我们在五种代表性后门攻击场景下评估了所提方法，并与前沿防御方案进行了对比。结果表明，TNC-Defense 以可忽略的额外开销将平均检测准确率提升了 $11\%$，并以仅轻微降低的生成质量，平均使 $98.5\%$ 的触发样本失效。