Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes neural networks, consistently reducing training loss compared to both baseline and the standard SVRG across various architectures and image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at https://github.com/davidyyd/alpha-SVRG.
翻译:随机方差缩减梯度(SVRG)由Johnson与Zhang(2013)提出,是一种理论上具有吸引力的优化方法。然而,正如Defazio与Bottou(2019)指出的,其在深度学习中的有效性尚未得到证实。本研究展示了SVRG在优化真实神经网络中的潜力。通过分析发现,对于较深的网络,SVRG中方差缩减项的强度应较小,并随训练进程递减。受此启发,我们引入一个乘法系数$\alpha$来控制该强度,并通过线性衰减调度对其进行调整。我们将所提方法命名为$\alpha$-SVRG。实验结果表明,$\alpha$-SVRG能更好地优化神经网络,在各种架构和图像分类数据集上,相较于基线方法和标准SVRG,其训练损失持续降低。我们希望这一发现能推动对深度学习方差缩减技术的进一步探索。代码见https://github.com/davidyyd/alpha-SVRG。