Black-box adversarial attacks have shown strong potential to subvert machine learning models. Existing black-box attacks craft adversarial examples by iteratively querying the target model and/or leveraging the transferability of a local surrogate model. Recently, such attacks can be effectively mitigated by state-of-the-art (SOTA) defenses, e.g., detection via the pattern of sequential queries, or injecting noise into the model. To our best knowledge, we take the first step to study a new paradigm of black-box attacks with provable guarantees -- certifiable black-box attacks that can guarantee the attack success probability (ASP) of adversarial examples before querying over the target model. This new black-box attack unveils significant vulnerabilities of machine learning models, compared to traditional empirical black-box attacks, e.g., breaking strong SOTA defenses with provable confidence, constructing a space of (infinite) adversarial examples with high ASP, and the ASP of the generated adversarial examples is theoretically guaranteed without verification/queries over the target model. Specifically, we establish a novel theoretical foundation for ensuring the ASP of the black-box attack with randomized adversarial examples (AEs). Then, we propose several novel techniques to craft the randomized AEs while reducing the perturbation size for better imperceptibility. Finally, we have comprehensively evaluated the certifiable black-box attacks on the CIFAR10/100, ImageNet, and LibriSpeech datasets, while benchmarking with 16 SOTA empirical black-box attacks, against various SOTA defenses in the domains of computer vision and speech recognition. Both theoretical and experimental results have validated the significance of the proposed attack.
翻译:黑盒对抗攻击已展现出颠覆机器学习模型的强大潜力。现有黑盒攻击通过迭代查询目标模型和/或利用本地代理模型的可迁移性来构造对抗样本。近年来,此类攻击可被最先进的防御机制有效缓解,例如通过序列查询模式进行检测,或向模型注入噪声。据我们所知,我们首次研究了具有可证明保证的黑盒攻击新范式——可认证黑盒攻击,该攻击能在查询目标模型前保证对抗样本的攻击成功率。相较于传统经验性黑盒攻击,这种新型黑盒攻击揭示了机器学习模型的重大脆弱性,例如:以可证明置信度突破强大的最先进防御、构建具有高攻击成功率的(无限)对抗样本空间,且所生成对抗样本的攻击成功率无需经过目标模型验证/查询即具备理论保证。具体而言,我们为基于随机对抗样本的黑盒攻击建立了确保攻击成功率的全新理论基础。随后,我们提出了多项创新技术来构造随机对抗样本,同时降低扰动幅度以提升不可感知性。最后,我们在CIFAR10/100、ImageNet和LibriSpeech数据集上全面评估了可认证黑盒攻击,同时以16种最先进经验性黑盒攻击作为基准,在计算机视觉和语音识别领域对抗多种最先进防御机制。理论与实验结果共同验证了所提出攻击方法的重要意义。