Despite being highly over-parametrized, and having the ability to fully interpolate the training data, deep networks are known to generalize well to unseen data. It is now understood that part of the reason for this is that the training algorithms used have certain implicit regularization properties that ensure interpolating solutions with "good" properties are found. This is best understood in linear over-parametrized models where it has been shown that the celebrated stochastic gradient descent (SGD) algorithm finds an interpolating solution that is closest in Euclidean distance to the initial weight vector. Different regularizers, replacing Euclidean distance with Bregman divergence, can be obtained if we replace SGD with stochastic mirror descent (SMD). Empirical observations have shown that in the deep network setting, SMD achieves a generalization performance that is different from that of SGD (and which depends on the choice of SMD's potential function. In an attempt to begin to understand this behavior, we obtain the generalization error of SMD for over-parametrized linear models for a binary classification problem where the two classes are drawn from a Gaussian mixture model. We present simulation results that validate the theory and, in particular, introduce two data models, one for which SMD with an $\ell_2$ regularizer (i.e., SGD) outperforms SMD with an $\ell_1$ regularizer, and one for which the reverse happens.
翻译:尽管深度网络高度过参数化且能够完全插值训练数据,但众所周知它们对未见数据具有良好的泛化能力。目前认识到,部分原因在于所使用的训练算法具有特定的隐式正则化特性,确保能够找到具备“优良”性质的插值解。这一点在线性过参数化模型中已有深入理解:研究表明,著名的随机梯度下降算法能够找到与初始权重向量欧氏距离最近的插值解。若将随机梯度下降替换为随机镜像下降,则可获得不同正则化器——以Bregman散度替代欧氏距离。实验观察表明,在深度网络环境下,随机镜像下降的泛化性能与随机梯度下降存在差异(且该性能取决于随机镜像下降势函数的选择)。为初步理解这一现象,本文针对二元分类问题(两类样本服从高斯混合模型分布)推导了过参数化线性模型下随机镜像下降的泛化误差。我们给出了验证理论的仿真结果,并特别引入了两种数据模型:一种场景下采用ℓ₂正则化器(即随机梯度下降)的随机镜像下降优于ℓ₁正则化器的随机镜像下降,另一种场景则呈现相反结果。