Recently, with the advancement of deep learning, several applications in text classification have advanced significantly. However, this improvement comes with a cost because deep learning is vulnerable to adversarial examples. This weakness indicates that deep learning is not very robust. Fortunately, the input of a text classifier is discrete. Hence, it can prevent the classifier from state-of-the-art attacks. Nonetheless, previous works have generated black-box attacks that successfully manipulate the discrete values of the input to find adversarial examples. Therefore, instead of changing the discrete values, we transform the input into its embedding vector containing real values to perform the state-of-the-art white-box attacks. Then, we convert the perturbed embedding vector back into a text and name it an adversarial example. In summary, we create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
翻译:近年来,随着深度学习的进步,文本分类领域的多项应用取得了显著进展。然而,这种提升伴随着代价——深度学习容易受到对抗性样本的攻击。这一缺陷表明深度学习模型并不具备强鲁棒性。值得庆幸的是,文本分类器的输入具有离散性,这使其能够抵御现有最先进的攻击方法。尽管如此,先前研究已通过生成黑盒攻击成功操纵输入的离散值以寻找对抗性样本。因此,我们转而将离散输入转换为包含实数值的嵌入向量,利用这些连续值实施最先进的白盒攻击。随后将受扰动后的嵌入向量重建为文本,并将其定义为对抗性样本。总之,我们构建了一个通过利用分类器梯度来衡量文本分类模型鲁棒性的框架。