We propose and analyze an adaptive adversary that can retrain a Trojaned DNN and is also aware of SOTA output-based Trojaned model detectors. We show that such an adversary can ensure (1) high accuracy on both trigger-embedded and clean samples and (2) bypass detection. Our approach is based on an observation that the high dimensionality of the DNN parameters provides sufficient degrees of freedom to simultaneously achieve these objectives. We also enable SOTA detectors to be adaptive by allowing retraining to recalibrate their parameters, thus modeling a co-evolution of parameters of a Trojaned model and detectors. We then show that this co-evolution can be modeled as an iterative game, and prove that the resulting (optimal) solution of this interactive game leads to the adversary successfully achieving the above objectives. In addition, we provide a greedy algorithm for the adversary to select a minimum number of input samples for embedding triggers. We show that for cross-entropy or log-likelihood loss functions used by the DNNs, the greedy algorithm provides provable guarantees on the needed number of trigger-embedded input samples. Extensive experiments on four diverse datasets -- MNIST, CIFAR-10, CIFAR-100, and SpeechCommand -- reveal that the adversary effectively evades four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP, and TABOR.
翻译:我们提出并分析了一种自适应攻击者,该攻击者能够重新训练植入木马的深度神经网络(DNN),同时知晓当前最先进的基于输出的木马模型检测器。我们证明,此类攻击者可以实现以下目标:(1)在触发嵌入样本和干净样本上均保持高精度;(2)规避检测。该方法基于一个关键观察:DNN参数的高维度特性提供了足够的自由度来同时实现上述目标。我们还通过允许检测器在重新训练过程中校准其参数,使最先进的检测器具备自适应能力,从而模拟木马模型与检测器之间的参数协同演化过程。我们进一步证明,这种协同演化可建模为迭代博弈,且该交互博弈的(最优)解会使攻击者成功达成上述目标。此外,我们为攻击者设计了一种贪心算法,用于选择最少数量的输入样本嵌入触发。研究表明,对于DNN使用的交叉熵或对数似然损失函数,该贪心算法可为所需触发嵌入样本数量提供可证明的保证。在四个不同数据集(MNIST、CIFAR-10、CIFAR-100和SpeechCommand)上的大量实验表明,该攻击者能有效规避四种最先进的基于输出的木马模型检测器:MNTD、NeuralCleanse、STRIP和TABOR。