Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to high-dimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, vulnerability can arise from superposition - the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on image classifiers exhibit the structure predicted by our proposed mechanism. These findings reveal that adversarial vulnerability can be a byproduct of networks' representational compression, complementing existing explanations based on data properties or architectural factors.
翻译:对抗样本为何存在,以及它们为何能在不同模型间迁移?现有解释涉及高维几何、输入中的非鲁棒模式以及决策边界结构,但均未提供表征层面的机制,以阐明特定攻击为何成功以及攻击为何跨模型迁移。本文证明,对抗脆弱性可源于神经网络的高效信息编码。具体而言,脆弱性可能源于叠加现象——网络表征的概念数量超过其维度数,迫使表征非正交化从而产生干扰。这种干扰导致针对某一表征的扰动影响其他表征,形成由干扰模式决定的脆弱性。在叠加性可精确控制的合成场景中,我们确认叠加性足以产生对抗脆弱性。由此产生的攻击具有可预测性:PGD发现的扰动与基于干扰几何推导的理论最优扰动方向一致。在相似数据上训练的模型会形成相似的干扰模式,这解释了攻击的可迁移性。我们进一步证明,对图像分类器成功的攻击展现出本文提出机制所预测的结构。这些发现表明,对抗脆弱性可能是网络表征压缩的副产品,与基于数据属性或架构因素的现有解释形成互补。