Stochastic gradient descent (SGD) has emerged as the quintessential method in a data scientist's toolbox. Using SGD for high-stakes applications requires, however, careful quantification of the associated uncertainty. Towards that end, in this work, we establish a high-dimensional Central Limit Theorem (CLT) for linear functionals of online SGD iterates for overparametrized least-squares regression with non-isotropic Gaussian inputs. Our result shows that a CLT holds even when the dimensionality is of order exponential in the number of iterations of the online SGD, which, to the best of our knowledge, is the first such result. In order to use the developed result in practice, we further develop an online approach for estimating the expectation and the variance terms appearing in the CLT, and establish high-probability bounds for the developed online estimator. Furthermore, we propose a two-step fully online bias-correction methodology which together with the CLT result and the variance estimation result, provides a fully online and data-driven way to numerically construct confidence intervals, thereby enabling practical high-dimensional algorithmic inference with SGD. We also extend our results to a class of single-index models, based on the Gaussian Stein's identity. We also provide numerical simulations to verify our theoretical findings in practice.
翻译:随机梯度下降(SGD)已成为数据科学家工具箱中的核心方法。然而,将SGD应用于高风险任务需要仔细量化其相关的不确定性。为此,本文针对过参数化最小二乘回归中具有非各向同性高斯输入的在线SGD迭代,建立了其线性函数量的高维中心极限定理(CLT)。我们的结果表明,即使维数达到在线SGD迭代次数的指数阶,CLT仍然成立,据我们所知,这是首个此类结果。为了在实践中应用这一结果,我们进一步开发了一种在线方法,用于估计CLT中出现的期望和方差项,并给出了所提在线估计量的高概率界。此外,我们提出了一种两步完全在线偏差校正方法,结合CLT结果和方差估计结果,为数值构建置信区间提供了一种完全在线且数据驱动的方式,从而实现了SGD在实际高维算法推断中的应用。基于高斯斯坦恒等式,我们还将结果推广到一类单指标模型。我们还提供了数值模拟以验证理论结果在实际中的应用。