We propose a new framework for algorithmic stability in the context of multiclass classification. In practice, classification algorithms often operate by first assigning a continuous score (for instance, an estimated probability) to each possible label, then taking the maximizer -- i.e., selecting the class that has the highest score. A drawback of this type of approach is that it is inherently unstable, meaning that it is very sensitive to slight perturbations of the training data, since taking the maximizer is discontinuous. Motivated by this challenge, we propose a pipeline for constructing stable classifiers from data, using bagging (i.e., resampling and averaging) to produce stable continuous scores, and then using a stable relaxation of argmax, which we call the "inflated argmax," to convert these scores to a set of candidate labels. The resulting stability guarantee places no distributional assumptions on the data, does not depend on the number of classes or dimensionality of the covariates, and holds for any base classifier. Using a common benchmark data set, we demonstrate that the inflated argmax provides necessary protection against unstable classifiers, without loss of accuracy.
翻译:本文提出了一种多类别分类背景下算法稳定性的新框架。实践中,分类算法通常首先为每个可能标签分配连续分数(例如估计概率),然后取最大化值——即选择得分最高的类别。此类方法的缺陷在于其本质上的不稳定性,即对训练数据的微小扰动极为敏感,因为取最大化操作具有不连续性。针对这一挑战,我们提出了一种从数据构建稳定分类器的流程:通过装袋法(即重采样与平均)生成稳定的连续分数,随后采用我们称之为"膨胀argmax"的稳定松弛化方法,将这些分数转换为候选标签集合。所得稳定性保证无需对数据作分布假设,不依赖于类别数量或协变量维度,且适用于任何基分类器。通过在通用基准数据集上的实验,我们证明膨胀argmax能在保持准确性的同时,为不稳定分类器提供必要的保护机制。