This work studies the generalization error of gradient methods. More specifically, we focus on how training steps $T$ and step-size $\eta$ might affect generalization in smooth stochastic convex optimization (SCO) problems. We first provide tight excess risk lower bounds for Gradient Descent (GD) and Stochastic Gradient Descent (SGD) under the general non-realizable smooth SCO setting, suggesting that existing stability analyses are tight in step-size and iteration dependence, and that overfitting provably happens. Next, we study the case when the loss is realizable, i.e. an optimal solution minimizes all the data points. Recent works show better rates can be attained but the improvement is reduced when training time is long. Our paper examines this observation by providing excess risk lower bounds for GD and SGD in two realizable settings: 1) $\eta T = \bigO{n}$, and (2) $\eta T = \bigOmega{n}$, where $n$ is the size of dataset. In the first case $\eta T = \bigOmega{n}$, our lower bounds tightly match and certify the respective upper bounds. However, for the case $\eta T = \bigOmega{n}$, our analysis indicates a gap between the lower and upper bounds. A conjecture is proposed that the gap can be closed by improving upper bounds, supported by analyses in two special scenarios.
翻译:本文研究梯度方法的泛化误差。具体而言,我们关注在光滑随机凸优化问题中,训练步数$T$和步长$\eta$如何影响泛化性能。首先,在一般非可实现光滑随机凸优化设定下,我们为梯度下降法和随机梯度下降法提供了严格的过量风险下界,表明现有稳定性分析在步长和迭代次数依赖性上是最优的,且过拟合现象确实存在。其次,我们研究损失函数可实现的情况,即存在一个最优解同时最小化所有数据点。近期研究表明可实现设定下可获得更优的收敛率,但当训练时间较长时改进程度有所降低。本文通过为两种可实现设定下的GD和SGD提供过量风险下界来检验这一观察:1)$\eta T = \bigO{n}$;2)$\eta T = \bigOmega{n}$,其中$n$为数据集规模。在第一种情况$\eta T = \bigO{n}$中,我们的下界与相应上界严格匹配并证实其最优性。然而对于$\eta T = \bigOmega{n}$的情况,分析表明上下界之间存在差距。我们提出假设认为该差距可通过改进上界来消除,并在两个特殊场景下的分析为该假设提供了支持。