Despite significant advances, deep networks remain highly susceptible to adversarial attack. One fundamental challenge is that small input perturbations can often produce large movements in the network's final-layer feature space. In this paper, we define an attack model that abstracts this challenge, to help understand its intrinsic properties. In our model, the adversary may move data an arbitrary distance in feature space but only in random low-dimensional subspaces. We prove such adversaries can be quite powerful: defeating any algorithm that must classify any input it is given. However, by allowing the algorithm to abstain on unusual inputs, we show such adversaries can be overcome when classes are reasonably well-separated in feature space. We further provide strong theoretical guarantees for setting algorithm parameters to optimize over accuracy-abstention trade-offs using data-driven methods. Our results provide new robustness guarantees for nearest-neighbor style algorithms, and also have application to contrastive learning, where we empirically demonstrate the ability of such algorithms to obtain high robust accuracy with low abstention rates. Our model is also motivated by strategic classification, where entities being classified aim to manipulate their observable features to produce a preferred classification, and we provide new insights into that area as well.
翻译:尽管取得了显著进展,深度网络仍然极易受到对抗性攻击。其中一个根本性挑战在于,微小的输入扰动往往会导致网络最终层特征空间中产生巨大变化。本文定义了一种抽象此类挑战的攻击模型,以帮助理解其内在属性。在该模型中,攻击者可在特征空间中将数据移动任意距离,但仅限于随机低维子空间。我们证明此类攻击者具有相当强大的能力:能够击败任何必须对所有输入进行分类的算法。然而,通过允许算法对异常输入进行弃权处理,我们表明当类别在特征空间中具有合理分离度时,此类攻击者能够被克服。我们进一步提供了强理论保证,通过数据驱动方法设置算法参数以优化准确率-弃权率之间的权衡。我们的结果为最近邻类算法提供了新的鲁棒性保证,同时可应用于对比学习领域——实验表明此类算法能够在低弃权率下获得高鲁棒准确率。该模型同样受到策略性分类的启发(在此场景中,被分类实体试图操纵其可观测特征以获得偏好分类),我们亦为该领域提供了新的见解。