Distribution shifts and adversarial examples are two major challenges for deploying machine learning models. While these challenges have been studied individually, their combination is an important topic that remains relatively under-explored. In this work, we study the problem of adversarial robustness under a common setting of distribution shift - unsupervised domain adaptation (UDA). Specifically, given a labeled source domain $D_S$ and an unlabeled target domain $D_T$ with related but different distributions, the goal is to obtain an adversarially robust model for $D_T$. The absence of target domain labels poses a unique challenge, as conventional adversarial robustness defenses cannot be directly applied to $D_T$. To address this challenge, we first establish a generalization bound for the adversarial target loss, which consists of (i) terms related to the loss on the data, and (ii) a measure of worst-case domain divergence. Motivated by this bound, we develop a novel unified defense framework called Divergence Aware adveRsarial Training (DART), which can be used in conjunction with a variety of standard UDA methods; e.g., DANN [Ganin and Lempitsky, 2015]. DART is applicable to general threat models, including the popular $\ell_p$-norm model, and does not require heuristic regularizers or architectural changes. We also release DomainRobust: a testbed for evaluating robustness of UDA models to adversarial attacks. DomainRobust consists of 4 multi-domain benchmark datasets (with 46 source-target pairs) and 7 meta-algorithms with a total of 11 variants. Our large-scale experiments demonstrate that on average, DART significantly enhances model robustness on all benchmarks compared to the state of the art, while maintaining competitive standard accuracy. The relative improvement in robustness from DART reaches up to 29.2% on the source-target domain pairs considered.
翻译:分布偏移与对抗样本是部署机器学习模型的两大主要挑战。尽管这两个挑战已分别被深入研究,但其共同作用的影响仍是一个相对未被充分探索的重要课题。本文研究在无监督域适应(UDA)这一常见分布偏移设置下的对抗鲁棒性问题。具体而言,给定一个带标签的源域$D_S$和一个分布相关但不同的无标签目标域$D_T$,目标是获得一个针对$D_T$的对抗鲁棒模型。目标域标签的缺失带来了独特挑战,因为传统的对抗鲁棒防御方法无法直接应用于$D_T$。为解决该问题,我们首先建立了对抗目标损失的泛化界,该边界包含:(i) 与数据损失相关的项,以及(ii) 最坏情况域散度的度量。基于该泛化界,我们提出了一种新颖的统一防御框架——散度感知对抗训练(DART),其可与多种标准UDA方法(例如DANN [Ganin and Lempitsky, 2015])结合使用。DART适用于包括流行的$\ell_p$-范数模型在内的通用威胁模型,且无需启发式正则化器或架构修改。此外,我们发布了DomainRobust:一个用于评估UDA模型对对抗攻击鲁棒性的测试平台。DomainRobust包含4个多域基准数据集(含46个源-目标域对)以及7种元算法(共11个变体)。大规模实验表明,与现有技术相比,DART在所有基准测试上平均显著提升了模型鲁棒性,同时保持了具有竞争力的标准准确率。在所考虑的源-目标域对中,DART带来的鲁棒性相对提升最高可达29.2%。