PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

It is widely known that state-of-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of $(\alpha,\zeta)$-safe machine learning model. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than $\alpha$ (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than $\alpha$), is less than $\zeta$. We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is $(\alpha,\zeta)$-safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models - including various sizes of vision Transformer (ViT) and ResNet models - impaired by a variety of adversarial attacks, such as PGDAttack, MomentumAttack, GenAttack and BanditAttack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and large models are generally more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.

翻译：众所周知，包括视觉和语言模型在内的最先进机器学习模型可能因对抗性扰动而受到严重损害。因此，开发在存在最有效对抗攻击时认证其性能的能力变得日益重要。本文提出了一种新方法，以在具有总体风险保证的条件下认证机器学习模型在对抗攻击下的性能。具体而言，我们引入了$(\alpha,\zeta)$-安全机器学习模型的概念。我们提出了一种基于校准集可用的假设检验程序，以推导统计保证：当模型实际上不安全（即模型的对抗总体风险高于$\alpha$）时，宣称机器学习模型的对抗（总体）风险小于$\alpha$（即模型安全）的概率低于$\zeta$。我们还提出了贝叶斯优化算法，以在存在对抗攻击时高效确定机器学习模型是否为$(\alpha,\zeta)$-安全，并附有统计保证。我们将所提框架应用于一系列受多种对抗攻击（如PGDAttack、MomentumAttack、GenAttack和BanditAttack）影响的机器学习模型——包括不同规模的视觉Transformer（ViT）和ResNet模型——以阐明我们方法的运作。重要的是，我们发现ViT通常比ResNet对对抗攻击更具鲁棒性，且大型模型通常比较小模型更鲁棒。我们的方法超越了现有基于经验对抗风险的认证保证，它构建了严格（且可证明的）性能保证，可用于满足强制使用最先进技术工具的监管要求。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日