Existing work in trustworthy machine learning primarily focuses on single-input adversarial perturbations. In many real-world attack scenarios, input-agnostic adversarial attacks, e.g. universal adversarial perturbations (UAPs), are much more feasible. Current certified training methods train models robust to single-input perturbations but achieve suboptimal clean and UAP accuracy, thereby limiting their applicability in practical applications. We propose a novel method, CITRUS, for certified training of networks robust against UAP attackers. We show in an extensive evaluation across different datasets, architectures, and perturbation magnitudes that our method outperforms traditional certified training methods on standard accuracy (up to 10.3\%) and achieves SOTA performance on the more practical certified UAP accuracy metric.
翻译:现有可信机器学习工作主要关注针对单输入对抗扰动的方法。在许多现实攻击场景中,与输入无关的对抗攻击(如通用对抗扰动UAPs)更具可行性。当前认证训练方法训练出的模型虽对单输入扰动具有鲁棒性,但在干净样本和UAP精度上表现欠佳,限制了其在实际应用中的适用性。我们提出了一种名为CITRUS的新方法,用于对网络进行针对UAP攻击者的认证训练。通过在不同数据集、架构和扰动幅度下的广泛评估,我们证明该方法在标准精度指标上优于传统认证训练方法(最高提升10.3%),并在更具实用性的认证UAP精度指标上达到了当前最优性能。