In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them. In this work, we present a different approach to the backdoor attack problem. Specifically, we show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data--and thus impossible to "detect" in a general sense. Then, guided by this observation, we revisit existing defenses against backdoor attacks and characterize the (often latent) assumptions they make and on which they depend. Finally, we explore an alternative perspective on backdoor attacks: one that assumes these attacks correspond to the strongest feature in the training data. Under this assumption (which we make formal) we develop a new primitive for detecting backdoor attacks. Our primitive naturally gives rise to a detection algorithm that comes with theoretical guarantees and is effective in practice.
翻译:在后门攻击中,攻击者将恶意构造的后门样本插入训练集,使最终模型易受操控。防御此类攻击通常将这些插入样本视为训练集中的异常值,并利用稳健统计技术进行检测与移除。在本工作中,我们提出一种不同的后门攻击问题视角。具体而言,我们证明:若无训练数据分布的结构信息,后门攻击与数据中自然存在的特征无法区分——因而在一般意义上不可能被“检测”。基于这一观察,我们重新审视现有后门攻击防御方法,揭示了它们所依赖的(往往是隐式的)假设。最后,我们探索后门攻击的另一种视角:假设这些攻击对应训练数据中的最强特征。在此(我们形式化定义的)假设下,我们开发了一种检测后门攻击的新基元。该基元自然衍生出具有理论保证并实际有效的检测算法。