Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the composition order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which produces parameter iterates at each step by reverting the usual forward composition order of batch gradients. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights that the extra freedom of modifying the usual iteration composition by reusing creatively previous batches at each optimization step may have important beneficial effects in improving training. Our experiments provide a proof of concept supporting this phenomenon. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.
翻译:尽管取得了卓越成就,训练神经网络仍计算成本高昂,且常因破坏收敛的不稳定性而受阻。学习率调度策略虽能缓解这些问题,但寻找最优调度耗时且资源密集。本研究探讨了恒定学习率(即无调度)与小批量场景下训练稳定性的理论问题。令人惊讶的是,我们证明了梯度更新中的组合顺序会影响基于梯度优化器的稳定性和收敛性。我们以反向SGD为例阐述这一新思路——该算法通过反转标准前向组合批梯度顺序生成每步参数迭代值。理论分析表明:在收缩区域(如极小值附近),反向SGD收敛至单点,而标准前向SGD通常仅收敛到分布。这带来了可实验验证的收敛性与稳定性提升。尽管完整反向SGD在实践中计算强度高,但它揭示了通过创意性复用先前批数据修改常规迭代组合顺序这一额外自由度,可能对改善训练产生重要有益影响。我们的实验为这一现象提供了概念验证。据我们所知,这代表了深度学习优化中一条全新且尚未探索的路径。