Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering

Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.

翻译：大多数现有检测带后门机器学习模型的方法采用以下两种途径之一：触发反演（即逆向工程）和权重分析（即模型诊断）。其中，基于梯度的触发反演被认为是最高效的后门检测技术之一，这已得到TrojAI竞赛、后门检测挑战赛以及backdoorBench的验证。然而，关于该技术为何如此有效，更重要的是，它是否提高了后门攻击的难度，目前仍鲜有研究。本文首次尝试通过分析带后门模型在其携带触发器的输入附近的改变率来回答此问题。研究表明，现有攻击倾向于在携带触发器的输入附近注入具有低改变率特征的后门，而这极易被基于梯度的触发反演捕获。同时，我们发现低改变率并非后门攻击成功的必要条件：我们设计了一种名为“梯度整形”（GRASP）的新型攻击增强技术，其遵循对抗训练相反的方向，降低带后门模型对触发器的改变率，同时不削弱其后门效果。此外，我们从理论上解释了该新技术的有效性以及基于梯度触发反演的根本缺陷。最后，通过理论与实验分析证明，GRASP增强技术不仅不会降低针对基于权重分析的后门检测方法的隐蔽攻击效果，也不会影响其他不依赖检测的后门缓解方法。