Adversarial examples, deliberately crafted using small perturbations to fool deep neural networks, were first studied in image processing and more recently in NLP. While approaches to detecting adversarial examples in NLP have largely relied on search over input perturbations, image processing has seen a range of techniques that aim to characterise adversarial subspaces over the learned representations. In this paper, we adapt two such approaches to NLP, one based on nearest neighbors and influence functions and one on Mahalanobis distances. The former in particular produces a state-of-the-art detector when compared against several strong baselines; moreover, the novel use of influence functions provides insight into how the nature of adversarial example subspaces in NLP relate to those in image processing, and also how they differ depending on the kind of NLP task.
翻译:对抗样本是通过精心构造的微小扰动来欺骗深度神经网络的样本,最初在图像处理领域被研究,近期才扩展到自然语言处理领域。虽然自然语言处理中检测对抗样本的方法主要依赖于对输入扰动的搜索,但图像处理领域已涌现出一系列旨在刻画学得表示中对抗子空间特征的技术。本文将这些技术中的两种方法——基于最近邻与影响函数的方法以及基于马氏距离的方法——适配到自然语言处理。其中,前者在与多个强基线方法的对比中实现了当前最优的检测性能;更为重要的是,影响函数的创新性应用揭示了自然语言处理中对抗样本子空间与图像处理中相应子空间的关系,以及这些子空间如何因自然语言处理任务类型的差异而呈现不同特征。