A Unified Analysis for Finite Weight Averaging

Averaging iterations of Stochastic Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Stochastic Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA). Especially, with a finite weight averaging method, LAWA can attain faster convergence and better generalization. However, its theoretical explanation is still less explored since there are fundamental differences between finite and infinite settings. In this work, we first generalize SGD and LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization. A key challenge is the inapplicability of traditional methods in the sense of expectation or optimal values for infinite-dimensional settings in analyzing FWA's convergence. Second, the cumulative gradients introduced by FWA introduce additional confusion to the generalization analysis, especially making it more difficult to discuss them under different assumptions. Extending the final iteration convergence analysis to the FWA, this paper, under a convexity assumption, establishes a convergence bound $\mathcal{O}(\log\left(\frac{T}{k}\right)/\sqrt{T})$, where $k\in[1, T/2]$ is a constant representing the last $k$ iterations. Compared to SGD with $\mathcal{O}(\log(T)/\sqrt{T})$, we prove theoretically that FWA has a faster convergence rate and explain the effect of the number of average points. In the generalization analysis, we find a recursive representation for bounding the cumulative gradient using mathematical induction. We provide bounds for constant and decay learning rates and the convex and non-convex cases to show the good generalization performance of FWA. Finally, experimental results on several benchmarks verify our theoretical results.

翻译：随机梯度下降（SGD）的迭代平均方法在训练深度学习模型方面取得了实证成功，例如随机权重平均（SWA）、指数移动平均（EMA）和最新权重平均（LAWA）。特别是，通过有限权重平均方法，LAWA 能够实现更快的收敛和更好的泛化能力。然而，由于有限设置与无限设置之间存在根本性差异，其理论解释仍较少被探索。在这项工作中，我们首先将 SGD 和 LAWA 推广为有限权重平均（FWA），并从优化和泛化的角度解释其相较于 SGD 的优势。一个关键挑战在于，在分析 FWA 的收敛性时，传统基于期望或最优值的无限维设置方法不再适用。其次，FWA 引入的累积梯度给泛化分析带来了额外的复杂性，尤其是在不同假设下讨论它们变得更加困难。本文将最终迭代收敛分析扩展到 FWA，在凸性假设下建立了收敛界 $\mathcal{O}(\log\left(\frac{T}{k}\right)/\sqrt{T})$，其中 $k\in[1, T/2]$ 是一个常数，代表最后 $k$ 次迭代。与具有 $\mathcal{O}(\log(T)/\sqrt{T})$ 的 SGD 相比，我们从理论上证明了 FWA 具有更快的收敛速度，并解释了平均点数量的影响。在泛化分析中，我们利用数学归纳法找到了用于界定累积梯度的递归表示。我们为恒定和衰减学习率以及凸与非凸情况提供了界，以展示 FWA 良好的泛化性能。最后，在多个基准测试上的实验结果验证了我们的理论结果。