Adversarial training is widely used to make classifiers robust to a specific threat or adversary, such as $\ell_p$-norm bounded perturbations of a given $p$-norm. However, existing methods for training classifiers robust to multiple threats require knowledge of all attacks during training and remain vulnerable to unseen distribution shifts. In this work, we describe how to obtain adversarially-robust model soups (i.e., linear combinations of parameters) that smoothly trade-off robustness to different $\ell_p$-norm bounded adversaries. We demonstrate that such soups allow us to control the type and level of robustness, and can achieve robustness to all threats without jointly training on all of them. In some cases, the resulting model soups are more robust to a given $\ell_p$-norm adversary than the constituent model specialized against that same adversary. Finally, we show that adversarially-robust model soups can be a viable tool to adapt to distribution shifts from a few examples.
翻译:对抗训练广泛用于使分类器对特定威胁或攻击(例如给定$p$-范数的$\ell_p$范数有界扰动)具有鲁棒性。然而,现有针对多重威胁训练鲁棒分类器的方法需要在训练期间预知所有攻击类型,且仍难以应对未见过的分布偏移。本文描述了如何获得对抗鲁棒模型汤(即参数的线性组合),使其能够在不同$\ell_p$范数有界攻击之间平滑权衡鲁棒性。我们证明,此类模型汤可控制鲁棒性的类型与水平,且无需联合训练即可实现对所有威胁的鲁棒性。在某些情况下,针对特定$\ell_p$范数攻击优化的模型汤,其鲁棒性甚至优于专门针对该攻击训练的组成模型。最后,我们证明对抗鲁棒模型汤可作为一种有效工具,通过少量样本适应分布偏移。