Learning algorithms that divide the data into batches are prevalent in many machine-learning applications, typically offering useful trade-offs between computational efficiency and performance. In this paper, we examine the benefits of batch-partitioning through the lens of a minimum-norm overparametrized linear regression model with isotropic Gaussian features. We suggest a natural small-batch version of the minimum-norm estimator and derive bounds on its quadratic risk. We then characterize the optimal batch size and show it is inversely proportional to the noise level, as well as to the overparametrization ratio. In contrast to minimum-norm, our estimator admits a stable risk behavior that is monotonically increasing in the overparametrization ratio, eliminating both the blowup at the interpolation point and the double-descent phenomenon. We further show that shrinking the batch minimum-norm estimator by a factor equal to the Weiner coefficient further stabilizes it and results in lower quadratic risk in all settings. Interestingly, we observe that the implicit regularization offered by the batch partition is partially explained by feature overlap between the batches. Our bound is derived via a novel combination of techniques, in particular normal approximation in the Wasserstein metric of noisy projections over random subspaces.
翻译:将数据划分为批次的学习算法在众多机器学习应用中普遍存在,通常能在计算效率与性能之间提供有益的权衡。本文通过具有各向同性高斯特征的最小范数过参数化线性回归模型,考察批划分的优势。我们提出一种自然的小批量版本最小范数估计器,并推导其二次风险的上界。随后,我们刻画了最优批次大小,并证明其与噪声水平及过参数化比率成反比。与最小范数估计器相比,我们的估计器呈现出稳定的风险行为,其风险随过参数化比率单调递增,从而消除了插值点处的风险爆发及双下降现象。我们进一步证明,将批最小范数估计器按维纳系数等比例收缩可进一步增强其稳定性,并在所有设定下获得更低的二次风险。有趣的是,我们观察到批划分所提供的隐式正则化效应,部分可由批次间的特征重叠解释。我们的上界推导基于一种新颖的技术组合,特别是随机子空间上噪声投影在Wasserstein度量中的正态逼近方法。