Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers

Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.

翻译：尽管对抗性训练策略的研究已广泛开展以提升鲁棒性，但即便是最鲁棒的深度学习模型，其决策仍可能对不可察觉的扰动极为敏感，这为高风险现实应用部署带来了严重隐患。虽然检测此类情况至关重要，但使用对抗性攻击在实例层面评估模型脆弱性计算成本过高，且不适用于实时部署场景。输入空间边界是检测非鲁棒样本的确切指标，但对深度神经网络而言难以精确计算。本文引入边界一致性的概念——该性质将鲁棒模型中的输入空间边界与逻辑边界相关联——以实现对脆弱样本的高效检测。首先，我们证明边界一致性是使用模型逻辑边界作为非鲁棒样本识别指标的必要充分条件。其次，通过对CIFAR10和CIFAR100数据集上多种鲁棒训练模型的全面实证分析，我们发现这些模型展现出显著的边界一致性，其输入空间边界与逻辑边界之间存在强相关性。接着，我们证明可以有效地利用逻辑边界，在此类模型中可靠地检测脆弱决策，并通过仅在小规模子集上估计输入边界，准确预测任意大规模测试集上的鲁棒准确率。最后，我们针对模型边界一致性不足的情况，提出从特征表示中学习伪边界的方法。本研究揭示了利用深度表示高效评估部署场景中对抗性脆弱性的潜力。