On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

While momentum-based acceleration has been studied extensively in deterministic optimization problems, its behavior in nonstationary environments -- where the data distribution and optimal parameters drift over time -- remains underexplored. We analyze the tracking performance of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak heavy-ball and Nesterov) under uniform strong convexity and smoothness in varying stepsize regimes. We derive finite-time bounds in expectation and with high probability for the tracking error, establishing a sharp decomposition into three components: a transient initialization term, a noise-induced variance term, and a drift-induced tracking lag. Crucially, our analysis uncovers a fundamental trade-off: while momentum can suppress gradient noise, it incurs an explicit penalty on the tracking capability. We show that momentum can substantially amplify drift-induced tracking error, with amplification that becomes unbounded as the momentum parameter approaches one, formalizing the intuition that using 'stale' gradients hinders adaptation to rapid regime shifts. Complementing these upper bounds, we establish minimax lower bounds for dynamic regret under gradient-variation constraints. These lower bounds prove that the inertia-induced penalty is not an artifact of analysis but an information-theoretic barrier: in drift-dominated regimes, momentum creates an unavoidable 'inertia window' that fundamentally degrades performance. Collectively, these results provide a definitive theoretical grounding for the empirical instability of momentum in dynamic environments and delineate the precise regime boundaries where SGD provably outperforms its accelerated counterparts.

翻译：虽然动量加速技术在确定性优化问题中已得到广泛研究，但其在非平稳环境中的行为——即数据分布和最优参数随时间漂移的情况——仍未得到充分探索。本文分析了随机梯度下降（SGD）及其动量变体（Polyak重球法和Nesterov加速法）在均匀强凸性和光滑性条件下、采用不同步长机制时的跟踪性能。我们推导了跟踪误差在期望意义下和高概率意义下的有限时间界，并将其精确分解为三个分量：瞬态初始化项、噪声诱导方差项和漂移诱导跟踪滞后项。关键的是，我们的分析揭示了一个根本性的权衡：动量虽能抑制梯度噪声，却会对跟踪能力产生显式惩罚。我们证明动量会显著放大漂移诱导的跟踪误差，且当动量参数趋近于1时，放大效应将趋于无穷大，这形式化地印证了使用“过时”梯度会阻碍对快速状态切换的适应这一直觉。作为这些上界的补充，我们在梯度变化约束下建立了动态遗憾的极小极大下界。这些下界证明惯性诱导的惩罚并非分析假象，而是信息论层面的根本障碍：在漂移主导的机制中，动量会形成不可避免的“惯性窗口”，从而从根本上降低性能。综上，这些结果为动态环境中动量方法经验性不稳定的现象提供了明确的理论依据，并精确界定了SGD在理论上可证明优于其加速变体的机制范围。