Vision Transformers (ViTs) are becoming a very popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification. However, although early works implied that this network structure had increased robustness against adversarial attacks, some works argue ViTs are still vulnerable. This paper presents our first attempt toward detecting adversarial attacks during inference time using the network's input and outputs as well as latent features. We design four quantifications (or derivatives) of input, output, and latent vectors of ViT-based models that provide a signature of the inference, which could be beneficial for the attack detection, and empirically study their behavior over clean samples and adversarial samples. The results demonstrate that the quantifications from input (images) and output (posterior probabilities) are promising for distinguishing clean and adversarial samples, while latent vectors offer less discriminative power, though they give some insights on how adversarial perturbations work.
翻译:视觉Transformer(ViTs)正成为视觉任务中非常流行的范式,因其在图像分类上取得了最先进的性能。然而,尽管早期研究暗示这种网络结构对对抗攻击具有更强的鲁棒性,但部分研究认为ViTs仍然脆弱。本文首次尝试利用网络的输入、输出以及潜在特征在推理时检测对抗攻击。我们设计了四种基于ViT模型的输入、输出及潜在向量的量化指标(或其衍生量),这些指标能提供推理过程的特征签名,有助于攻击检测,并基于干净样本和对抗样本实证研究了其行为。结果表明,来自输入(图像)和输出(后验概率)的量化指标在区分干净样本与对抗样本方面具有前景,而潜在向量虽能提供对抗扰动机制的部分见解,但其判别能力较弱。