It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. While this phenomenon has been extensively studied in linear regression, the benefit of multi-pass gradient descent (GD, which reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) is not well-understood in nonlinear and non-convex settings, except for a loss modification mechanism achieved by the first two passes on the data. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.
翻译:传统的经验表明,重复使用训练数据超过一次可以提升基于梯度的学习的统计效率。尽管这一现象在线性回归中已被广泛研究,但在非线性和非凸设置下,多遍梯度下降(GD,即重复使用所有数据)相较于单次随机梯度下降(在线SGD,即每个数据点仅使用一次)的优势尚未被充分理解,除了通过前两次数据遍历实现损失修正机制的情况外。本文考虑学习一个具有二次激活函数的$d$维单索引模型,已知单次SGD需要$n\gtrsim d\log d$个样本才能实现弱恢复。我们首先证明,在相关损失函数上,全批量球形GD的样本复杂度中仍然存在此$\log d$因子;然而,仅通过截断激活函数,全批量GD在$n \simeq d$个样本时展现出有利的优化景观,从而在统计效率上优于单次SGD(使用相同激活函数)。我们进一步通过从微小初始化出发的平方损失上的全批量GD轨迹分析补充了这一结果,表明$n \gtrsim d$个样本和$T \gtrsim\log d$次梯度步骤足以实现强(精确)恢复。