Despite their impressive performance in classification tasks, neural networks are known to be vulnerable to adversarial attacks, subtle perturbations of the input data designed to deceive the model. In this work, we investigate the relation between these perturbations and the implicit bias of neural networks trained with gradient-based algorithms. To this end, we analyse the network's implicit bias through the lens of the Fourier transform. Specifically, we identify the minimal and most critical frequencies necessary for accurate classification or misclassification respectively for each input image and its adversarially perturbed version, and uncover the correlation among those. To this end, among other methods, we use a newly introduced technique capable of detecting non-linear correlations between high-dimensional datasets. Our results provide empirical evidence that the network bias in Fourier space and the target frequencies of adversarial attacks are highly correlated and suggest new potential strategies for adversarial defence.
翻译:尽管神经网络在分类任务中表现出色,但已知其对对抗性攻击具有脆弱性——这些攻击通过对输入数据进行精心设计的细微扰动来欺骗模型。在本研究中,我们探究了此类扰动与基于梯度算法训练的神经网络隐式偏置之间的关系。为此,我们通过傅里叶变换的视角分析网络的隐式偏置。具体而言,我们针对每幅输入图像及其对抗扰动版本,分别识别出实现准确分类或错误分类所需的最小且最关键的频率,并揭示其间的关联性。为实现这一目标,我们采用了包括新引入技术在内的多种方法,该技术能够检测高维数据集之间的非线性相关性。我们的研究结果为傅里叶空间中网络偏置与对抗攻击目标频率存在高度相关性提供了实证依据,并为对抗性防御提出了新的潜在策略。