We present a novel framework for generating adversarial benchmarks to evaluate the robustness of image classification models. Our framework allows users to customize the types of distortions to be optimally applied to images, which helps address the specific distortions relevant to their deployment. The benchmark can generate datasets at various distortion levels to assess the robustness of different image classifiers. Our results show that the adversarial samples generated by our framework with any of the image classification models, like ResNet-50, Inception-V3, and VGG-16, are effective and transferable to other models causing them to fail. These failures happen even when these models are adversarially retrained using state-of-the-art techniques, demonstrating the generalizability of our adversarial samples. We achieve competitive performance in terms of net $L_2$ distortion compared to state-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we demonstrate our framework achieves such results with simple distortions like Gaussian noise without introducing unnatural artifacts or color bleeds. This is made possible by a model-based reinforcement learning (RL) agent and a technique that reduces a deep tree search of the image for model sensitivity to perturbations, to a one-level analysis and action. The flexibility of choosing distortions and setting classification probability thresholds for multiple classes makes our framework suitable for algorithmic audits.
翻译:我们提出了一种新颖的对抗性基准生成框架,用于评估图像分类模型的鲁棒性。该框架允许用户自定义最优应用于图像的失真类型,从而有效处理与其部署场景相关的特定失真。该基准可生成不同失真级别的数据集,以评估各类图像分类器的鲁棒性。实验表明,通过ResNet-50、Inception-V3和VGG-16等图像分类模型生成的对抗样本具有有效性,并可迁移至其他模型导致其分类失败。即使这些模型采用最先进的对抗训练技术进行再训练,此类失效现象仍会存在,这充分证明了所生成对抗样本的泛化能力。在CIFAR-10和ImageNet数据集上,我们实现了与现有先进基准技术相当的净$L_2$失真性能;但特别值得指出的是,本框架仅通过高斯噪声等简单失真即可达到此效果,且不会引入非自然伪影或色彩渗散。这一突破得益于基于模型的强化学习(RL)智能体,以及将图像模型扰动敏感性的深度树搜索简化为单层级分析与动作的技术。由于可灵活选择失真类型并设置多类别的分类概率阈值,本框架特别适用于算法审计场景。