Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers

Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate high margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively and confidently use the logit margin to detect brittle decisions with such models. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to assess adversarial vulnerability in deployment scenarios efficiently.

翻译：尽管对抗训练策略的研究已相当深入以提升模型鲁棒性，但即使是最鲁棒的深度学习模型，其决策仍可能对难以察觉的扰动极为敏感，这为高风险现实应用部署带来了严重隐患。虽然检测此类案例至关重要，但使用对抗攻击在单样本层面评估模型脆弱性计算成本过高，难以适用于实时部署场景。输入空间边界是检测非鲁棒样本的确切指标，但对深度神经网络而言难以精确计算。本文引入边界一致性的概念——该性质将鲁棒模型中的输入空间边界与逻辑边界相关联——以实现脆弱样本的高效检测。首先，我们证明边界一致性是使用模型逻辑边界作为非鲁棒样本识别指标的充分必要条件。其次，通过对CIFAR10和CIFAR100数据集上多种鲁棒训练模型的全面实证分析，我们发现这些模型表现出高度的边界一致性，其输入空间边界与逻辑边界之间存在强相关性。随后，我们证明可以高效且可靠地利用逻辑边界检测此类模型的脆弱决策。最后，针对模型边界一致性不足的情况，我们提出通过特征表示学习伪边界的方法。本研究结果凸显了利用深度表示高效评估部署场景中对抗脆弱性的潜力。