On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

While momentum-based acceleration has been studied extensively in deterministic optimization problems, its behavior in nonstationary environments -- where the data distribution and optimal parameters drift over time -- remains underexplored. We analyze the tracking performance of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak heavy-ball and Nesterov) under uniform strong convexity and smoothness in varying stepsize regimes. We derive finite-time bounds in expectation and with high probability for the tracking error, establishing a sharp decomposition into three components: a transient initialization term, a noise-induced variance term, and a drift-induced tracking lag. Crucially, our analysis uncovers a fundamental trade-off: while momentum can suppress gradient noise, it incurs an explicit penalty on the tracking capability. We show that momentum can substantially amplify drift-induced tracking error, with amplification that becomes unbounded as the momentum parameter approaches one, formalizing the intuition that using 'stale' gradients hinders adaptation to rapid regime shifts. Complementing these upper bounds, we establish minimax lower bounds for dynamic regret under gradient-variation constraints. These lower bounds prove that the inertia-induced penalty is not an artifact of analysis but an information-theoretic barrier: in drift-dominated regimes, momentum creates an unavoidable 'inertia window' that fundamentally degrades performance. Collectively, these results provide a definitive theoretical grounding for the empirical instability of momentum in dynamic environments and delineate the precise regime boundaries where SGD provably outperforms its accelerated counterparts.

翻译：尽管动量加速在确定性优化问题中已得到广泛研究，但其在非平稳环境中的行为——即数据分布和最优参数随时间漂移的情形——仍未得到充分探索。我们在均匀强凸性和光滑性条件下，分析了随机梯度下降法（SGD）及其动量变体（Polyak重球法和Nesterov法）在不同步长机制下的跟踪性能。我们推导了跟踪误差的期望有限时间界和高概率界，并将其精确分解为三个分量：瞬态初始化项、噪声诱导方差项和漂移诱导跟踪滞后项。关键的是，我们的分析揭示了一个基本权衡：虽然动量可以抑制梯度噪声，但它会显式地损害跟踪能力。我们证明动量会显著放大漂移诱导的跟踪误差，且随着动量参数趋近于一，放大效应将趋于无界，这形式化了"使用'过时'梯度会阻碍对快速状态切换的适应"这一直觉。作为这些上界的补充，我们在梯度变化约束下建立了动态遗憾的极小极大下界。这些下界证明惯性诱导的惩罚并非分析假象，而是一个信息论障碍：在漂移主导的机制中，动量会产生一个不可避免的"惯性窗口"，从根本上降低性能。总体而言，这些结果为动态环境中动量不稳定的经验现象提供了明确的理论依据，并划定了SGD可证明优于其加速对应算法的精确机制边界。