DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

Self-supervised learning (SSL) is pervasively exploited in training high-quality upstream encoders with a large amount of unlabeled data. However, it is found to be susceptible to backdoor attacks merely via polluting a small portion of training data. The victim encoders associate triggered inputs with target embeddings, e.g., mapping a triggered cat image to an airplane embedding, such that the downstream tasks inherit unintended behaviors when the trigger is activated. Emerging backdoor attacks have shown great threats across different SSL paradigms such as contrastive learning and CLIP, yet limited research is devoted to defending against such attacks, and existing defenses fall short in detecting advanced stealthy backdoors. To address the limitations, we propose a novel detection mechanism, DeDe, which detects the activation of backdoor mappings caused by triggered inputs on victim encoders. Specifically, DeDe trains a decoder for any given SSL encoder using an auxiliary dataset (which can be out-of-distribution or even slightly poisoned), so that for any triggered input that misleads the encoder into the target embedding, the decoder generates an output image significantly different from the input. DeDe leverages the discrepancy between the input and the decoded output to identify potential backdoor misbehavior during inference. We empirically evaluate DeDe on both contrastive learning and CLIP models against various types of backdoor attacks. Our results demonstrate promising detection effectiveness over various advanced attacks and superior performance compared over state-of-the-art detection methods.

翻译：自监督学习（SSL）被广泛用于利用大量无标注数据训练高质量的上游编码器。然而，研究发现仅需污染一小部分训练数据，SSL便容易受到后门攻击。受害编码器会将触发输入与目标嵌入相关联，例如将带有触发器的猫图像映射到飞机嵌入，从而在触发器激活时，下游任务会继承非预期的行为。新兴的后门攻击已在对比学习和CLIP等不同SSL范式中显示出巨大威胁，但目前针对此类攻击的防御研究有限，且现有防御方法在检测高级隐蔽后门方面存在不足。为克服这些局限，我们提出了一种新颖的检测机制DeDe，用于检测受害编码器上由触发输入引起的后门映射激活。具体而言，DeDe使用辅助数据集（可以是分布外数据甚至轻微污染的数据）为任意给定的SSL编码器训练一个解码器，使得任何误导编码器生成目标嵌入的触发输入，其解码器输出的图像会与输入图像产生显著差异。DeDe利用输入与解码输出之间的差异，在推理过程中识别潜在的后门异常行为。我们在对比学习和CLIP模型上针对多种类型的后门攻击对DeDe进行了实证评估。结果表明，相较于最先进的检测方法，DeDe对各种高级攻击具有显著的检测效果和优越性能。