In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $β$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.
翻译:本文对随机梯度下降法(SGD)及其动量变体(Polyak Heavy-Ball与Nesterov方法)在强凸性及光滑性条件下追踪时变最优解的过程进行了全面的理论分析。我们的有限时间界揭示了跟踪误差可被清晰地分解为瞬态分量、噪声诱导分量与漂移诱导分量。该分解展现了一个根本性的权衡:虽然动量常被用作梯度平滑的启发式方法,但在分布漂移条件下,它会引发显式的漂移放大惩罚,且该惩罚随动量参数$β$趋近于1而发散,导致系统性的跟踪滞后。我们在梯度变化约束条件下,通过极小极大下界对这些上界进行了补充,证明这种动量诱导的跟踪惩罚并非分析假象,而是信息理论屏障:在漂移主导机制中,动量方法必然更差,因为陈旧梯度的平均化会强制产生系统性滞后。我们的研究结果为动态环境中动量方法经验性不稳定的现象提供了理论依据,并精确界定了普通SGD可证明优于其加速对应方法的具体机制边界。