Despite their impressive performance in classification, neural networks are known to be vulnerable to adversarial attacks. These attacks are small perturbations of the input data designed to fool the model. Naturally, a question arises regarding the potential connection between the architecture, settings, or properties of the model and the nature of the attack. In this work, we aim to shed light on this problem by focusing on the implicit bias of the neural network, which refers to its inherent inclination to favor specific patterns or outcomes. Specifically, we investigate one aspect of the implicit bias, which involves the essential Fourier frequencies required for accurate image classification. We conduct tests to assess the statistical relationship between these frequencies and those necessary for a successful attack. To delve into this relationship, we propose a new method that can uncover non-linear correlations between sets of coordinates, which, in our case, are the aforementioned frequencies. By exploiting the entanglement between intrinsic dimension and correlation, we provide empirical evidence that the network bias in Fourier space and the target frequencies of adversarial attacks are closely tied.
翻译:尽管神经网络在分类任务中表现出色,但已知其易受对抗攻击。此类攻击是对输入数据的微小扰动,旨在欺骗模型。自然产生一个问题:模型架构、设置或属性与攻击性质之间是否存在潜在联系?本研究旨在通过聚焦神经网络的隐式偏见(即其内在倾向于偏好特定模式或结果的固有倾向)来阐明该问题。具体而言,我们探究隐式偏见的一个方面——准确图像分类所需的基本傅里叶频率。我们通过实验评估这些频率与成功攻击所需频率之间的统计关系。为深入探索这一关系,我们提出一种新方法,可揭示坐标集(在本研究中指上述频率)之间的非线性相关性。通过利用内在维度与相关性之间的纠缠关系,我们提供经验证据表明:傅里叶空间中的网络偏见与对抗攻击的目标频率紧密关联。