Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.
翻译:尽管深度学习模型在广泛的人工智能任务中展现出卓越的性能和泛化能力,但已有研究表明,通过对自然输入添加难以察觉的恶意扰动,这些模型极易被欺骗。这类经过篡改的输入在文献中被称为对抗样本。本文提出了一种新颖的概率框架,旨在泛化并扩展对抗攻击,使得当我们将攻击方法应用于大量输入时,能够生成目标类别所需的概率分布。这种新型攻击范式使攻击者对目标模型拥有更强的控制力,从而在多种场景下暴露出传统范式无法实现的深度学习模型威胁。我们引入了四种高效生成此类攻击的策略,并通过扩展多种对抗攻击算法对我们提出的方法进行了演示。此外,我们分别在语音命令分类任务和推文情感分类任务(即音频与文本领域的两个典型机器学习问题)上对我们的方法进行了实验验证。结果表明,我们能够以高欺骗率逼近各类别的任意概率分布,甚至能防止攻击被标签偏移检测方法识别。