In order to train networks for verified adversarial robustness, it is common to over-approximate the worst-case loss over perturbation regions, resulting in networks that attain verifiability at the expense of standard performance. As shown in recent work, better trade-offs between accuracy and robustness can be obtained by carefully coupling adversarial training with over-approximations. We hypothesize that the expressivity of a loss function, which we formalize as the ability to span a range of trade-offs between lower and upper bounds to the worst-case loss through a single parameter (the over-approximation coefficient), is key to attaining state-of-the-art performance. To support our hypothesis, we show that trivial expressive losses, obtained via convex combinations between adversarial attacks and IBP bounds, yield state-of-the-art results across a variety of settings in spite of their conceptual simplicity. We provide a detailed analysis of the relationship between the over-approximation coefficient and performance profiles across different expressive losses, showing that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.
翻译:为了训练具有验证对抗鲁棒性的网络,通常会对扰动区域上的最坏情况损失进行过近似,这导致网络在牺牲标准性能的情况下获得可验证性。正如近期研究所示,通过仔细耦合对抗训练与过近似方法,可以在准确性与鲁棒性之间获得更好的权衡。我们假设损失函数的表达性(我们将之形式化为通过单一参数(过近似系数)在最坏情况损失的下界与上界之间跨越一系列权衡的能力)是达到最先进性能的关键。为支持这一假设,我们证明:尽管概念上简单,但通过对抗攻击与IBP边界之间的凸组合获得的平凡表达性损失,在各种设置下均能产生最先进的结果。我们详细分析了过近似系数与不同表达性损失下性能曲线之间的关系,结果表明,虽然表达性至关重要,但最坏情况损失的更优近似并不必然关联于更优的鲁棒性-准确性权衡。