Assessing Neural Network Robustness via Adversarial Pivotal Tuning

The robustness of image classifiers is essential to their deployment in the real world. The ability to assess this resilience to manipulations or deviations from the training data is thus crucial. These modifications have traditionally consisted of minimal changes that still manage to fool classifiers, and modern approaches are increasingly robust to them. Semantic manipulations that modify elements of an image in meaningful ways have thus gained traction for this purpose. However, they have primarily been limited to style, color, or attribute changes. While expressive, these manipulations do not make use of the full capabilities of a pretrained generative model. In this work, we aim to bridge this gap. We show how a pretrained image generator can be used to semantically manipulate images in a detailed, diverse, and photorealistic way while still preserving the class of the original image. Inspired by recent GAN-based image inversion methods, we propose a method called Adversarial Pivotal Tuning (APT). Given an image, APT first finds a pivot latent space input that reconstructs the image using a pretrained generator. It then adjusts the generator's weights to create small yet semantic manipulations in order to fool a pretrained classifier. APT preserves the full expressive editing capabilities of the generative model. We demonstrate that APT is capable of a wide range of class-preserving semantic image manipulations that fool a variety of pretrained classifiers. Finally, we show that classifiers that are robust to other benchmarks are not robust to APT manipulations and suggest a method to improve them. Code available at: https://captaine.github.io/apt/

翻译：图像分类器的鲁棒性对其在实际场景中的部署至关重要。评估这种对操作或训练数据偏差的抵抗能力因此成为关键。传统上，这些修改主要由能够欺骗分类器的最小变化构成，而现代方法对此类攻击的鲁棒性日益增强。有鉴于此，以有意义方式修改图像元素的语义操作逐渐受到关注。然而，这类操作主要局限于风格、色彩或属性变化。尽管具有表现力，但这些操作并未充分利用预训练生成模型的全部能力。本研究旨在弥合这一差距。我们展示了如何利用预训练图像生成器以精细、多样且逼真的方式对图像进行语义操作，同时保留原始图像的类别标签。受近期基于GAN的图像反演方法启发，我们提出了一种名为对抗性枢轴调优（APT）的方法。给定一张图像，APT首先找到能够通过预训练生成器重建该图像的枢轴潜空间输入，随后调整生成器权重以产生微小但具有语义意义的操作，从而欺骗预训练分类器。APT保留了生成模型完整的表现性编辑能力。我们证明APT能够执行多种保留类别的语义图像操作，有效欺骗各类预训练分类器。最后，我们展示了对其他基准测试具有鲁棒性的分类器无法抵御APT操作，并提出改进方法。代码可访问：https://captaine.github.io/apt/