Backdoor attacks poison the training data, causing the model to behave normally on clean inputs but predict attacker-chosen labels when trigger patterns are embedded into the input samples. Defending against such attacks is highly challenging, especially when the defender has limited access to clean data. Existing defense methods often rely on restrictive assumptions-such as high poisoning ratios or poisoning strategies-limiting their practicality and generalization. To overcome these limitations, we propose Prototype-Guided Robust Learning (PGRL), a defense that only requires a small set of verified benign samples, and integrates two complementary components during fine-tuning: Label Consistency Verification (LCV), which detects and removes suspicious samples from the potentially poisoned dataset; and Feature Distance Estimation (FDE), which enforces the unlearning of backdoor-related representations. Extensive experiments against eight existing defenses show that PGRL achieves superior robustness across diverse architectures, datasets, and advanced attack scenarios, establishing a new standard for practical and generalizable backdoor defense.
翻译:后门攻击会污染训练数据,使模型在干净输入上表现正常,但当触发器模式嵌入输入样本时,模型会预测攻击者选择的标签。防御此类攻击极具挑战性,尤其是在防御者仅能获取有限干净数据的情况下。现有防御方法通常依赖于限制性假设(如高投毒率或特定的投毒策略),从而限制了其实际可用性和泛化能力。为克服这些限制,我们提出原型引导的鲁棒学习(Prototype-Guided Robust Learning, PGRL),该方法仅需少量经过验证的良性样本,并在微调过程中集成两个互补组件:标签一致性验证(Label Consistency Verification, LCV),用于检测并从潜在被污染的数据集中移除可疑样本;以及特征距离估计(Feature Distance Estimation, FDE),用于强制遗忘与后门相关的表征。与现有八种防御方法的广泛实验表明,PGRL在不同架构、数据集和高级攻击场景下均实现了卓越的鲁棒性,为实用且可泛化的后门防御设立了新标准。