Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model for a given data distribution and comparing it to that achieved by state-of-the-art training methods is thus an important diagnostic tool. In this paper, we find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset. We provide a general framework for finding the optimal 0-1 loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints. We further define other variants of the attacker-classifier game that determine the range of the optimal loss more efficiently than the full-fledged hypergraph construction. Our evaluation shows, for the first time, an analysis of the gap to optimal robustness for classifiers in the multi-class setting on benchmark datasets.
翻译:寻找对对抗样本鲁棒的分类器是其安全部署的关键。确定在给定威胁模型下、针对特定数据分布的最优分类器的鲁棒性,并将其与最先进训练方法所达到的鲁棒性进行比较,是一项重要的诊断工具。本文针对任意离散数据集上的多类分类器,在存在测试时攻击者的场景下,推导了信息论意义上损失的可实现下界。我们提出了一种通用框架,用于求解最优0-1损失,该框架的核心是根据数据和对抗约束构建冲突超图。我们进一步定义了攻击者-分类器博弈的其他变体,与完整的超图构造相比,这些变体能更高效地确定最优损失的范围。实验评估首次展示了在基准数据集上,多类设定下分类器与最优鲁棒性之间的差距分析。