This work investigates the performance of the final iterate produced by stochastic gradient descent (SGD) under temporally dependent data. We consider two complementary sources of dependence: $(i)$ martingale-type dependence in both the covariate and noise processes, which accommodates non-stationary and non-mixing time series data, and $(ii)$ dependence induced by sequential decision making. Our formulation runs in parallel with classical notions of (local) stationarity and strong mixing, while neither framework fully subsumes the other. Remarkably, SGD is shown to automatically accommodate both independent and dependent information under a broad class of stepsize schedules and exploration rate schemes. Non-asymptotically, we show that SGD simultaneously achieves statistically optimal estimation error and regret, extending and improving existing results. In particular, our tail bounds remain sharp even for potentially infinite horizon $T=+\infty$. Asymptotically, the SGD iterates converge to a Gaussian distribution with only an $O_{\PP}(1/\sqrt{t})$ remainder, demonstrating that the supposed estimation-regret trade-off claimed in prior work can in fact be avoided. We further propose a new ``conic'' approximation of the decision region that allows the covariates to have unbounded support. For online sparse regression, we develop a new SGD-based algorithm that uses only $d$ units of storage and requires $O(d)$ flops per iteration, achieving the long term statistical optimality. Intuitively, each incoming observation contributes to estimation accuracy, while aggregated summary statistics guide support recovery.
翻译:本研究探讨了在时间依赖数据下随机梯度下降(SGD)最终迭代的性能表现。我们考虑了两个互补的依赖来源:$(i)$ 协变量与噪声过程中的鞅型依赖,这适用于非平稳且非混合的时间序列数据;$(ii)$ 由序列决策所诱导的依赖。我们的框架与经典的(局部)平稳性和强混合性概念并行,但两者均不完全包含对方。值得注意的是,在广泛的步长调度与探索率方案下,SGD 被证明能自动适应独立与依赖信息。在非渐近意义上,我们证明了 SGD 能同时达到统计最优的估计误差与遗憾,从而扩展并改进了现有结果。特别地,即使对于可能无限的时间范围 $T=+\infty$,我们的尾界依然保持尖锐。渐近地,SGD 迭代收敛于一个高斯分布,且仅具有 $O_{\PP}(1/\sqrt{t})$ 的余项,这表明先前工作中声称的估计-遗憾权衡实际上可以避免。我们进一步提出了一种新的决策区域“锥形”近似方法,允许协变量具有无界支撑。对于在线稀疏回归,我们开发了一种新的基于 SGD 的算法,该算法仅使用 $d$ 个存储单元且每次迭代仅需 $O(d)$ 次浮点运算,从而实现了长期的统计最优性。直观上,每个新观测值均有助于提升估计精度,而聚合的汇总统计量则指导支撑集的恢复。