Recent progress was made in characterizing the generalization error of gradient methods for general convex loss by the learning theory community. In this work, we focus on how training longer might affect generalization in smooth stochastic convex optimization (SCO) problems. We first provide tight lower bounds for general non-realizable SCO problems. Furthermore, existing upper bound results suggest that sample complexity can be improved by assuming the loss is realizable, i.e. an optimal solution simultaneously minimizes all the data points. However, this improvement is compromised when training time is long and lower bounds are lacking. Our paper examines this observation by providing excess risk lower bounds for gradient descent (GD) and stochastic gradient descent (SGD) in two realizable settings: 1) realizable with $T = O(n)$, and (2) realizable with $T = \Omega(n)$, where $T$ denotes the number of training iterations and $n$ is the size of the training dataset. These bounds are novel and informative in characterizing the relationship between $T$ and $n$. In the first small training horizon case, our lower bounds almost tightly match and provide the first optimal certificates for the corresponding upper bounds. However, for the realizable case with $T = \Omega(n)$, a gap exists between the lower and upper bounds. We provide a conjecture to address this problem, that the gap can be closed by improving upper bounds, which is supported by our analyses in one-dimensional and linear regression scenarios.
翻译:近期,学习理论界在刻画梯度方法对一般凸损失的泛化误差方面取得了进展。本文聚焦于训练时长如何影响平滑随机凸优化(SCO)问题的泛化性能。首先,针对一般不可实现SCO问题,我们给出了紧致的下界。此外,现有上界结果表明,通过假设损失函数是“可实现”的(即存在最优解同时最小化所有数据点),可改善样本复杂度。然而,当训练时间较长且缺乏下界时,这种改善会受到影响。本文通过为两种可实现场景下的梯度下降(GD)和随机梯度下降(SGD)提供超额风险下界来探究这一现象:1)可实现且训练迭代次数$T = O(n)$,以及(2)可实现且$T = \Omega(n)$,其中$T$表示训练迭代次数,$n$表示训练数据集规模。这些下界在刻画$T$与$n$的关系方面具有创新性和启发性。在第一种小训练时长场景中,我们的下界几乎严格匹配相应上界,并首次提供了最优验证。然而,在$T = \Omega(n)$的可实现场景中,下界与上界之间存在缺口。针对此问题,我们提出一个猜想:该缺口可通过改进上界来弥合,这一猜想在一维和线性回归场景的分析中得到了支持。