Neural networks have a number of shortcomings. Amongst the severest ones is the sensitivity to distribution shifts which allows models to be easily fooled into wrong predictions by small perturbations to inputs that are often imperceivable to humans and do not have to carry semantic meaning. Adversarial training poses a partial solution to address this issue by training models on worst-case perturbations. Yet, recent work has also pointed out that the reasoning in neural networks is different from humans. Humans identify objects by shape, while neural nets mainly employ texture cues. Exemplarily, a model trained on photographs will likely fail to generalize to datasets containing sketches. Interestingly, it was also shown that adversarial training seems to favorably increase the shift toward shape bias. In this work, we revisit this observation and provide an extensive analysis of this effect on various architectures, the common $\ell_2$- and $\ell_\infty$-training, and Transformer-based models. Further, we provide a possible explanation for this phenomenon from a frequency perspective.
翻译:神经网络存在若干缺陷,其中最严重的问题之一是对分布偏移的敏感性——输入中微小的扰动(往往人类无法察觉且不携带语义信息)便可轻易误导模型做出错误预测。对抗训练通过让模型对抗最坏情况下的扰动进行训练,为这一问题提供了部分解决方案。然而,近期研究也指出神经网络的推理机制与人类存在差异:人类通过形状识别物体,而神经网络主要依赖纹理线索。例如,在照片上训练的模型往往难以泛化到包含简笔画的数集中。有趣的是,已有研究显示对抗训练似乎能促使模型向形状偏好的方向转移。本研究重新审视了这一现象,并在多种架构、常见的$\ell_2$-范数和$\ell_\infty$-范数训练以及基于Transformer的模型上对此效应进行了深入分析。此外,我们从频率域视角为该现象提供了可能的解释。