We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method that unifies and generalizes recent averaging-based optimizers like single-worker DiLoCo and Schedule-Free, within a non-distributed setting. While DiLoCo relies on a memory-intensive two-loop structure to periodically aggregate pseudo-gradients using Nesterov momentum, GPA eliminates this complexity by decoupling Nesterov's interpolation constants to enable smooth iterate averaging at every step. Structurally, GPA resembles Schedule-Free but replaces uniform averaging with exponential moving averaging. Empirically, GPA consistently outperforms single-worker DiLoCo and AdamW with reduced memory overhead. GPA achieves speedups of 8.71%, 10.13%, and 9.58% over the AdamW baseline in terms of steps to reach target validation loss for Llama-160M, 1B, and 8B models, respectively. Similarly, on the ImageNet ViT workload, GPA achieves speedups of 7% and 25.5% in the small and large batch settings respectively. Furthermore, we prove that for any base optimizer with $O(\sqrt{T})$ regret, where $T$ is the number of iterations, GPA matches or exceeds the original convergence guarantees depending on the interpolation constants.
翻译:我们提出广义原始平均(GPA),这是一种Nesterov方法的扩展,它在非分布式环境下统一并泛化了近期基于平均的优化器(如单工作器DiLoCo和Schedule-Free)。DiLoCo依赖内存密集型的双循环结构,通过Nesterov动量周期性聚合伪梯度;而GPA通过解耦Nesterov插值常数,实现了每一步的平滑迭代平均,从而消除了这种复杂性。在结构上,GPA类似于Schedule-Free,但将均匀平均替换为指数移动平均。实证表明,GPA在降低内存开销的同时,持续优于单工作器DiLoCo和AdamW。在Llama-160M、1B和8B模型上,GPA达到目标验证损失所需的步数分别比AdamW基线加速8.71%、10.13%和9.58%。同样,在ImageNet ViT任务中,GPA在小批量和大批量设置下分别实现7%和25.5%的加速。此外,我们证明对于任何具有$O(\sqrt{T})$遗憾的基优化器(其中$T$为迭代次数),GPA根据插值常数的选择,可匹配或超越原始收敛保证。