Stopping Rules for Stochastic Gradient Descent via Anytime-Valid Confidence Sequences

The problem of stopping stochastic gradient descent (SGD) in an online manner, based solely on the observed trajectory, is a challenging theoretical problem with significant consequences for applications. While SGD is routinely monitored as it runs, the classical theory of SGD provides guarantees only at pre-specified iteration horizons and offers no valid way to decide, based on the observed trajectory, when further computation is justified. We address this longstanding gap by developing anytime-valid confidence sequences for stochastic gradient methods, which remain valid under continuous monitoring and directly induce statistically valid, trajectory-dependent stopping rules: stop as soon as the current upper confidence bound on an appropriate performance measure falls below a user-specified tolerance. The confidence sequences are constructed using nonnegative supermartingales, are time-uniform, and depend only on observable quantities along the SGD trajectory, without requiring prior knowledge of the optimization horizon. In convex optimization, this yields anytime-valid certificates for weighted suboptimality of projected SGD under general stepsize schedules, without assuming smoothness or strong convexity. In nonconvex optimization, it yields time-uniform certificates for weighted first-order stationarity under smoothness assumptions. We further characterize the stopping-time complexity of the resulting stopping rules under standard stepsize schedules. To the best of our knowledge, this is the first framework that provides statistically valid, time-uniform stopping rules for SGD across both convex and nonconvex settings based solely on its observed trajectory.

翻译：随机梯度下降（SGD）的在线停止问题——仅基于观测到的轨迹动态决定何时停止——是一个具有挑战性的理论问题，对实际应用具有重要影响。尽管SGD在运行过程中通常会被持续监控，但经典的SGD理论仅能对预先指定的迭代次数提供性能保证，而无法基于观测到的轨迹提供有效的方法来判断何时继续计算是合理的。我们通过为随机梯度方法构建任意时间有效的置信序列来填补这一长期存在的空白。这些置信序列在连续监控下始终保持有效性，并可直接导出具有统计有效性、依赖于轨迹的停止准则：一旦关于适当性能度量的当前置信上界低于用户指定的容差，即停止迭代。置信序列的构造基于非负上鞅，具有时间一致性，且仅依赖于SGD轨迹上的可观测量，无需预先知道优化过程的迭代总次数。在凸优化中，该方法可为一般步长调度下的投影SGD提供关于加权次优性的任意时间有效证书，且无需假设光滑性或强凸性。在非凸优化中，该方法可在光滑性假设下为加权一阶平稳性提供时间一致性证书。我们进一步分析了在标准步长调度下，所导出停止准则对应的停止时间复杂性。据我们所知，这是首个仅基于SGD观测轨迹、在凸与非凸两种设置下均能提供具有统计有效性且时间一致的停止准则的理论框架。