In order to train networks for verified adversarial robustness, it is common to over-approximate the worst-case loss over perturbation regions, resulting in networks that attain verifiability at the expense of standard performance. As shown in recent work, better trade-offs between accuracy and robustness can be obtained by carefully coupling adversarial training with over-approximations. We hypothesize that the expressivity of a loss function, which we formalize as the ability to span a range of trade-offs between lower and upper bounds to the worst-case loss through a single parameter (the over-approximation coefficient), is key to attaining state-of-the-art performance. To support our hypothesis, we show that trivial expressive losses, obtained via convex combinations between adversarial attacks and IBP bounds, yield state-of-the-art results across a variety of settings in spite of their conceptual simplicity. We provide a detailed analysis of the relationship between the over-approximation coefficient and performance profiles across different expressive losses, showing that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.
翻译:为了训练具有可验证对抗鲁棒性的网络,通常会对扰动区域上的最坏情况损失进行过度近似,这导致网络在牺牲标准性能的情况下获得可验证性。正如近期研究所表明,通过精心结合对抗训练与过度近似,可以在准确性与鲁棒性之间实现更优权衡。我们假设损失函数的表达性(形式化为通过单一参数、即过度近似系数,在最坏情况损失的下界与上界之间覆盖一系列权衡的能力)是实现最先进性能的关键。为支持该假设,我们证明通过对抗攻击与IBP界凸组合获得的简单表达性损失,尽管概念简洁,仍能在多种设置下产生最先进结果。我们详细分析了不同表达性损失下过度近似系数与性能曲线之间的关系,表明尽管表达性至关重要,但对最坏情况损失的更优逼近并不必然关联更优的鲁棒性-准确性权衡。