Investigating and unmasking feature-level vulnerabilities of CNNs to adversarial perturbations

This study explores the impact of adversarial perturbations on Convolutional Neural Networks (CNNs) with the aim of enhancing the understanding of their underlying mechanisms. Despite numerous defense methods proposed in the literature, there is still an incomplete understanding of this phenomenon. Instead of treating the entire model as vulnerable, we propose that specific feature maps learned during training contribute to the overall vulnerability. To investigate how the hidden representations learned by a CNN affect its vulnerability, we introduce the Adversarial Intervention framework. Experiments were conducted on models trained on three well-known computer vision datasets, subjecting them to attacks of different nature. Our focus centers on the effects that adversarial perturbations to a model's initial layer have on the overall behavior of the model. Empirical results revealed compelling insights: a) perturbing selected channel combinations in shallow layers causes significant disruptions; b) the channel combinations most responsible for the disruptions are common among different types of attacks; c) despite shared vulnerable combinations of channels, different attacks affect hidden representations with varying magnitudes; d) there exists a positive correlation between a kernel's magnitude and its vulnerability. In conclusion, this work introduces a novel framework to study the vulnerability of a CNN model to adversarial perturbations, revealing insights that contribute to a deeper understanding of the phenomenon. The identified properties pave the way for the development of efficient ad-hoc defense mechanisms in future applications.

翻译：本研究旨在通过探究对抗扰动对卷积神经网络（CNNs）的影响，以加深对其内在机制的理解。尽管文献中已提出多种防御方法，但对此现象的理解仍不完整。我们提出，并非整个模型都具有脆弱性，而是在训练过程中学习到的特定特征图导致了整体的脆弱性。为了研究CNN学习到的隐藏表征如何影响其脆弱性，我们引入了对抗干预框架。实验在基于三个知名计算机视觉数据集训练的模型上进行，并使其遭受不同性质的攻击。我们的研究重点集中于模型初始层的对抗扰动对模型整体行为的影响。实证结果揭示了若干引人注目的发现：a) 扰动浅层中选定的通道组合会导致显著的性能破坏；b) 对破坏负主要责任的通道组合在不同类型的攻击中是共通的；c) 尽管存在共通的脆弱通道组合，但不同攻击对隐藏表征的影响程度各异；d) 卷积核的幅度与其脆弱性之间存在正相关关系。总之，本工作引入了一个新颖的框架来研究CNN模型对对抗扰动的脆弱性，所揭示的见解有助于更深入地理解该现象。所识别的特性为未来应用中开发高效的特设防御机制铺平了道路。