Machine learning models are vulnerable to adversarial attacks. In this paper, we consider the scenario where a model is distributed to multiple buyers, among which a malicious buyer attempts to attack another buyer. The malicious buyer probes its copy of the model to search for adversarial samples and then presents the found samples to the victim's copy of the model in order to replicate the attack. We point out that by distributing different copies of the model to different buyers, we can mitigate the attack such that adversarial samples found on one copy would not work on another copy. We observed that training a model with different randomness indeed mitigates such replication to a certain degree. However, there is no guarantee and retraining is computationally expensive. A number of works extended the retraining method to enhance the differences among models. However, a very limited number of models can be produced using such methods and the computational cost becomes even higher. Therefore, we propose a flexible parameter rewriting method that directly modifies the model's parameters. This method does not require additional training and is able to generate a large number of copies in a more controllable manner, where each copy induces different adversarial regions. Experimentation studies show that rewriting can significantly mitigate the attacks while retaining high classification accuracy. For instance, on GTSRB dataset with respect to Hop Skip Jump attack, using attractor-based rewriter can reduce the success rate of replicating the attack to 0.5% while independently training copies with different randomness can reduce the success rate to 6.5%. From this study, we believe that there are many further directions worth exploring.
翻译:机器学习模型易受对抗攻击。本文考虑了一个场景:模型被分发给多个买家,其中恶意买家试图攻击另一位买家。恶意买家探测其拥有的模型副本以搜索对抗样本,然后将发现的样本呈现给受害者的模型副本,以复制攻击。我们指出,通过向不同买家分发不同的模型副本,可以缓解此类攻击,使得在一个副本上发现的对抗样本无法作用于另一个副本。我们观察到,使用不同的随机性训练模型确实能在一定程度上缓解这种复制。然而,这种方法缺乏保证且重新训练的计算成本高昂。已有若干研究扩展了重新训练方法以增强模型间的差异,但此类方法能生成的模型数量非常有限,且计算成本更高。因此,我们提出了一种灵活的参数重写方法,该方法直接修改模型参数。这种方法无需额外训练,能够以更可控的方式生成大量副本,每个副本诱导不同的对抗区域。实验研究表明,重写方法能显著缓解攻击,同时保持较高的分类准确率。例如,在GTSRB数据集上针对Hop Skip Jump攻击,使用基于吸引子的重写器可将复制攻击的成功率降至0.5%,而使用不同随机性独立训练的副本仅能将成功率降至6.5%。通过本研究,我们认为存在许多值得进一步探索的方向。