We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $Ω\left(\sqrt{ηT/m^{1.5}}\right)$ for Gradient Descent, where $η$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(ηT/m)$ and existing lower bounds from previous constructions.
翻译:我们研究了随机凸优化设置下最佳情况经验风险最小化器的样本复杂度。我们证明存在一个实例,其样本规模与维度呈线性关系,虽然学习是可行的,但经验风险最小化器很可能具有唯一性且会发生过拟合。这解决了Feldman提出的一个开放性问题。我们还将此结论推广至近似经验风险最小化器。基于我们的构造,我们还证明了(约束)梯度下降法在时间跨度和学习率随样本规模增长时可能产生过拟合。具体而言,我们为梯度下降法提出了一个新颖的泛化下界$Ω\left(\sqrt{ηT/m^{1.5}}\right)$,其中$η$为学习率,$T$为时间跨度,$m$为样本规模。这将已知最佳上界$O(ηT/m)$与现有构造所得下界之间的差距以指数级缩小。