In contemporary deep learning practice, models are often trained to near zero loss i.e. to nearly interpolate the training data. However, the number of parameters in the model is usually far more than the number of data points $n$, the theoretical minimum needed for interpolation: a phenomenon referred to as overparameterization. In an interesting piece of work that contributes to the considerable research that has been devoted to understand overparameterization, Bubeck and Sellke showed that for a broad class of covariate distributions (specifically those satisfying a natural notion of concentration of measure), overparameterization is necessary for robust interpolation i.e. if the interpolating function is required to be Lipschitz. However, their robustness results were proved only in the setting of regression with square loss. In practice, however many other kinds of losses are used, e.g. cross entropy loss for classification. In this work, we generalize Bubeck and Selke's result to Bregman divergence losses, which form a common generalization of square loss and cross-entropy loss. Our generalization relies on identifying a bias-variance type decomposition that lies at the heart of the proof and Bubeck and Sellke.
翻译:在当代深度学习实践中,模型通常被训练至接近零损失,即几乎完全拟合训练数据。然而,模型中的参数数量通常远多于数据点数量 $n$(即实现插值所需的理论最小值),这一现象被称为过参数化。在一项有助于理解过参数化的大量研究中,Bubeck 和 Sellke 证明了对于一大类协变量分布(特别是满足测度集中自然概念的分布),过参数化是实现鲁棒插值的必要条件,即若要求插值函数为 Lipschitz 连续则必须过参数化。然而,他们的鲁棒性结果仅在平方损失回归场景中得到证明。在实践中,许多其他类型的损失函数也被广泛使用,例如分类任务中的交叉熵损失。本工作将 Bubeck 和 Sellke 的结果推广至 Bregman 散度损失,该损失函数构成了平方损失与交叉熵损失的共同泛化形式。我们的推广依赖于识别出证明过程中的核心——一种偏差-方差型分解,这也正是 Bubeck 和 Sellke 原始论证的关键所在。