Deep neural networks are proven to be vulnerable to backdoor attacks. Detecting the trigger samples during the inference stage, i.e., the test-time trigger sample detection, can prevent the backdoor from being triggered. However, existing detection methods often require the defenders to have high accessibility to victim models, extra clean data, or knowledge about the appearance of backdoor triggers, limiting their practicality. In this paper, we propose the test-time corruption robustness consistency evaluation (TeCo), a novel test-time trigger sample detection method that only needs the hard-label outputs of the victim models without any extra information. Our journey begins with the intriguing observation that the backdoor-infected models have similar performance across different image corruptions for the clean images, but perform discrepantly for the trigger samples. Based on this phenomenon, we design TeCo to evaluate test-time robustness consistency by calculating the deviation of severity that leads to predictions' transition across different corruptions. Extensive experiments demonstrate that compared with state-of-the-art defenses, which even require either certain information about the trigger types or accessibility of clean data, TeCo outperforms them on different backdoor attacks, datasets, and model architectures, enjoying a higher AUROC by 10% and 5 times of stability.
翻译:深度神经网络已被证明易受后门攻击。在推理阶段检测触发样本,即测试时触发样本检测,可以防止后门被激活。然而,现有检测方法通常要求防御者对受害模型具有高访问权限、额外的干净数据或关于后门触发器外观的知识,这限制了其实用性。本文提出测试时腐败鲁棒性一致性评估(TeCo),一种新颖的测试时触发样本检测方法,该方法仅需受害模型的硬标签输出,无需任何额外信息。我们的研究始于一个有趣的观察:对于干净图像,受后门感染的模型在不同图像腐败情况下表现出相似的性能,但对于触发样本却表现出不一致的性能。基于这一现象,我们设计TeCo通过计算不同腐败情况下导致预测转变的严重性偏差来评估测试时鲁棒性一致性。大量实验表明,与需要触发器类型某些信息或可访问干净数据的最先进防御方法相比,TeCo在不同后门攻击、数据集和模型架构上均优于它们,AUROC值高出10%,稳定性提升5倍。