It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. However, beyond linear regression, the theoretical advantage of full-batch gradient descent (GD, which always reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) remains unclear. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.
翻译:在梯度下降学习中,重复使用训练数据可以提高统计效率,这已是共识。然而,除线性回归外,全批次梯度下降(始终复用所有数据)相对于单次随机梯度下降(每个数据点仅使用一次)的理论优势尚不明确。本研究考虑学习一个具有二次激活的 $d$ 维单索引模型,已知单次随机梯度下降需要 $n\gtrsim d\log d$ 个样本才能实现弱恢复。我们首先证明,对于相关性损失上的全批次球形梯度下降,样本复杂度中的 $\log d$ 因子仍然存在;然而,仅通过对激活函数进行截断,全批次梯度下降在 $n \simeq d$ 个样本时便展现出有利的优化景观,从而在统计效率上优于(使用相同激活函数的)单次随机梯度下降。我们通过分析小初始化下全批次梯度下降在平方损失上的轨迹来补充这一结果,表明 $n \gtrsim d$ 个样本和 $T \gtrsim\log d$ 次梯度步足以实现强(精确)恢复。