Deep neural networks are susceptible to backdoor attacks, where adversaries manipulate model predictions by inserting malicious samples into the training data. Currently, there is still a lack of direct filtering methods for identifying suspicious training data to unveil potential backdoor samples. In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled clean validation data. PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels with dropout applied during inference, while backdoor samples exhibit less PS. We hypothesize PS results from neuron bias effect, making neurons favor features of certain classes. PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference. Extensive experiments have been conducted to verify the effectiveness and efficiency of PSBD, which achieves state-of-the-art results among mainstream detection methods.
翻译:深度神经网络易受后门攻击,攻击者通过在训练数据中插入恶意样本操纵模型预测。目前,仍缺乏直接过滤可疑训练数据以揭示潜在后门样本的方法。本文提出一种新颖方法——预测偏移后门检测(PSBD),该基于不确定性的方法仅需少量无标签干净验证数据。PSBD受有趣的预测偏移(PS)现象启发:在推理过程中启用丢弃层时,被污染模型对干净数据的预测往往从真实标签偏离至其他标签,而后门样本的PS较小。我们假设PS源于神经元偏置效应,使神经元偏向于特定类别的特征。PSBD通过计算预测偏移不确定性(PSU)——即模型推理时交替启用/禁用丢弃层时概率值的方差,识别后门训练样本。大量实验验证了PSBD的有效性和效率,其在主流检测方法中达到最优结果。