Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks--most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and construct triggers similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers--imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's potential adversarial characteristics.
翻译:知识蒸馏对压缩大型模型至关重要,然而依赖从第三方仓库下载预训练的"教师"模型会引入严重的安全风险——尤其是后门攻击。现有知识蒸馏后门方法通常复杂且计算密集:它们采用代理学生模型和模拟蒸馏来保证可迁移性,并构建类似通用对抗扰动(UAP)的触发器,这些触发器在量级上不隐蔽,本质上具有强烈的对抗行为。本文质疑这种复杂性是否必要,并构建了隐蔽的"弱"触发器——具有可忽略对抗效应的不可察觉扰动。我们提出BackWeak,一种简单、无代理的攻击范式。BackWeak表明,通过仅使用很小的学习率微调带有弱触发器的良性教师,即可植入强大的后门。我们证明,这种精细的微调足以嵌入一个后门,在受害者的标准蒸馏过程中可靠地迁移到多种学生架构,从而获得高攻击成功率。在多种数据集、模型架构和知识蒸馏方法上的大量实证评估表明,BackWeak比之前的复杂方法更高效、更简单,且通常更隐蔽。这项工作呼吁研究知识蒸馏后门攻击的学者特别关注触发器的潜在对抗特性。