Studying the robustness of machine learning models is important to ensure consistent model behaviour across real-world settings. To this end, adversarial robustness is a standard framework, which views robustness of predictions through a binary lens: either a worst-case adversarial misclassification exists in the local region around an input, or it does not. However, this binary perspective does not account for the degrees of vulnerability, as data points with a larger number of misclassified examples in their neighborhoods are more vulnerable. In this work, we consider a complementary framework for robustness, called average-case robustness, which measures the fraction of points in a local region that provides consistent predictions. However, computing this quantity is hard, as standard Monte Carlo approaches are inefficient especially for high-dimensional inputs. In this work, we propose the first analytical estimators for average-case robustness for multi-class classifiers. We show empirically that our estimators are accurate and efficient for standard deep learning models and demonstrate their usefulness for identifying vulnerable data points, as well as quantifying robustness bias of models. Overall, our tools provide a complementary view to robustness, improving our ability to characterize model behaviour.
翻译:研究机器学习模型的鲁棒性对于确保模型在真实世界场景中表现一致至关重要。为此,对抗鲁棒性是一个标准框架,它通过二元视角看待预测的鲁棒性:在输入局部区域内要么存在最坏情况的对抗性误分类,要么不存在。然而,这种二元视角未能考虑脆弱性的程度,因为邻域内误分类示例数量更多的数据点更为脆弱。在本工作中,我们考虑一种互补的鲁棒性框架,称为平均鲁棒性,它衡量局部区域内提供一致预测的点的比例。然而,计算该量是困难的,因为标准的蒙特卡洛方法效率低下,尤其对于高维输入。在本工作中,我们首次提出了针对多类分类器的平均鲁棒性解析估计器。我们通过实验证明,对于标准深度学习模型,我们的估计器准确且高效,并展示了它们在识别脆弱数据点以及量化模型鲁棒性偏差方面的实用性。总体而言,我们的工具提供了对鲁棒性的互补视角,增强了我们表征模型行为的能力。