DNN-based image classification models are susceptible to adversarial attacks. Most previous adversarial attacks do not focus on the interpretability of the generated adversarial examples, and we cannot gain insights into the mechanism of the target classifier from the attacks. Therefore, we propose Adversarial Doodles, which have interpretable shapes. We optimize black b\'ezier curves to fool the target classifier by overlaying them onto the input image. By introducing random perspective transformation and regularizing the doodled area, we obtain compact attacks that cause misclassification even when humans replicate them by hand. Adversarial doodles provide describable and intriguing insights into the relationship between our attacks and the classifier's output. We utilize adversarial doodles and discover the bias inherent in the target classifier, such as "We add two strokes on its head, a triangle onto its body, and two lines inside the triangle on a bird image. Then, the classifier misclassifies the image as a butterfly."
翻译:基于深度神经网络的图像分类模型易受对抗性攻击的影响。以往大多数对抗性攻击不关注生成对抗样本的可解释性,我们无法从中深入了解目标分类器的机制。因此,我们提出对抗性涂鸦,其具有可解释的形状。我们优化黑色贝塞尔曲线,通过将其叠加到输入图像上来欺骗目标分类器。通过引入随机透视变换并对涂鸦区域进行正则化,我们获得了即使在人类手工复现时也能导致误分类的紧凑型攻击。对抗性涂鸦提供了关于攻击与分类器输出之间关系的可描述且引人入胜的洞见。我们利用对抗性涂鸦揭示了目标分类器固有的偏差,例如“我们在鸟图像的头部添加两笔、身体上添加一个三角形以及三角形内部添加两条线,分类器便将图像误分类为蝴蝶。”