It is well known that Empirical Risk Minimization (ERM) with squared loss may attain minimax suboptimal error rates (Birg\'e and Massart, 1993). The key message of this paper is that, under mild assumptions, the suboptimality of ERM must be due to large bias rather than variance. More precisely, in the bias-variance decomposition of the squared error of the ERM, the variance term necessarily enjoys the minimax rate. In the case of fixed design, we provide an elementary proof of this fact using the probabilistic method. Then, we prove this result for various models in the random design setting. In addition, we provide a simple proof of Chatterjee's admissibility theorem (Chatterjee, 2014, Theorem 1.4), which states that ERM cannot be ruled out as an optimal method, in the fixed design setting, and extend this result to the random design setting. We also show that our estimates imply stability of ERM, complementing the main result of Caponnetto and Rakhlin (2006) for non-Donsker classes. Finally, we show that for non-Donsker classes, there are functions close to the ERM, yet far from being almost-minimizers of the empirical loss, highlighting the somewhat irregular nature of the loss landscape.
翻译:众所周知,采用平方损失的经验风险最小化(ERM)可能无法达到极小化最优误差率(Birgé 和 Massart, 1993)。本文的核心观点是,在温和假设下,ERM 的次优性必然源于较大偏差而非方差。更准确地说,在 ERM 平方误差的偏差-方差分解中,方差项必然具有极小化最优速率。在固定设计情形下,我们利用概率方法给出了这一事实的初等证明。随后,我们在随机设计情形下针对多种模型证明了这一结果。此外,我们给出了 Chatterjee 容许性定理(Chatterjee, 2014, 定理 1.4)的简洁证明,该定理表明在固定设计情形下,不能排除 ERM 作为最优方法的可能性,并将这一结果推广至随机设计情形。我们还证明,我们的估计蕴含 ERM 的稳定性,从而补充了 Caponnetto 和 Rakhlin(2006)关于非 Donsker 类的主要结论。最后,我们证明对于非 Donsker 类,存在接近 ERM 的函数却远非经验损失的几乎最小化者,这突出了损失景观某种程度上的不规则性。