It is widely known that state-of-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of $(\alpha,\zeta)$-safe machine learning model. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than $\alpha$ (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than $\alpha$), is less than $\zeta$. We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models - including various sizes of vision Transformer (ViT) and ResNet models - impaired by a variety of adversarial attacks, such as PGDAttack, MomentumAttack, GenAttack and BanditAttack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and large models are generally more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.
翻译:众所周知,包括视觉和语言模型在内的最先进机器学习模型可能因对抗性扰动而受到严重损害。因此,开发在存在最有效对抗攻击时认证其性能的能力变得日益重要。本文提出了一种新方法,以在具有总体风险保证的条件下认证机器学习模型在对抗攻击下的性能。具体而言,我们引入了$(\alpha,\zeta)$-安全机器学习模型的概念。我们提出了一种基于校准集可用的假设检验程序,以推导统计保证:当模型实际上不安全(即模型的对抗总体风险高于$\alpha$)时,宣称机器学习模型的对抗(总体)风险小于$\alpha$(即模型安全)的概率低于$\zeta$。我们还提出了贝叶斯优化算法,以在存在对抗攻击时高效确定机器学习模型是否为$(\alpha,\zeta)$-安全,并附有统计保证。我们将所提框架应用于一系列受多种对抗攻击(如PGDAttack、MomentumAttack、GenAttack和BanditAttack)影响的机器学习模型——包括不同规模的视觉Transformer(ViT)和ResNet模型——以阐明我们方法的运作。重要的是,我们发现ViT通常比ResNet对对抗攻击更具鲁棒性,且大型模型通常比较小模型更鲁棒。我们的方法超越了现有基于经验对抗风险的认证保证,它构建了严格(且可证明的)性能保证,可用于满足强制使用最先进技术工具的监管要求。