In contemporary deep learning practice, models are often trained to near zero loss i.e. to nearly interpolate the training data. However, the number of parameters in the model is usually far more than the number of data points $n$, the theoretical minimum needed for interpolation: a phenomenon referred to as overparameterization. In an interesting piece of work that contributes to the considerable research that has been devoted to understand overparameterization, Bubeck and Sellke showed that for a broad class of covariate distributions (specifically those satisfying a natural notion of concentration of measure), overparameterization is necessary for robust interpolation i.e. if the interpolating function is required to be Lipschitz. However, their robustness results were proved only in the setting of regression with square loss. In practice, however many other kinds of losses are used, e.g. cross entropy loss for classification. In this work, we generalize Bubeck and Selke's result to Bregman divergence losses, which form a common generalization of square loss and cross-entropy loss. Our generalization relies on identifying a bias-variance type decomposition that lies at the heart of the proof and Bubeck and Sellke.
翻译:在当代深度学习实践中,模型通常被训练至接近零损失,即几乎完全拟合训练数据。然而,模型参数数量通常远多于数据点数量$n$——这是实现插值所需的理论最小值,这种现象被称为过参数化。在一项有助于理解过参数化的大量研究中,Bubeck和Sellke证明了对于广泛的协变量分布(特别是满足测度集中自然概念的分布),鲁棒插值需要过参数化,即若要求插值函数满足Lipschitz连续性。然而,他们的鲁棒性结果仅在平方损失回归场景中得到证明。在实际应用中,人们常使用许多其他类型的损失函数,例如分类任务中的交叉熵损失。本研究将Bubeck和Selke的结果推广至Bregman散度损失,该损失函数构成了平方损失与交叉熵损失的共同泛化形式。我们的推广依赖于识别证明过程中的偏置-方差型分解,这一分解正是Bubeck和Sellke证明的核心所在。