Machine learning backdoors have the property that the machine learning model should work as expected on normal inputs, but when the input contains a specific $\textit{trigger}$, it behaves as the attacker desires. Detecting such triggers has been proven to be extremely difficult. In this paper, we present a novel and explainable approach to detect and eliminate such backdoor triggers based on active paths found in neural networks. We present promising experimental evidence of our approach, which involves injecting backdoors into a machine learning model used for intrusion detection.
翻译:机器学习后门具有如下特性:模型在正常输入下表现如预期,但当输入包含特定$\textit{触发器}$时,其行为将遵从攻击者意图。此类触发器的检测已被证明极为困难。本文提出一种基于神经网络活跃路径的可解释性方法,用于检测并消除此类后门触发器。我们通过在用于入侵检测的机器学习模型中注入后门进行实验,为所提方法提供了具有前景的实验证据。