Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions

Stochastic gradient methods are central to large-scale learning, yet their analysis typically treats mini-batch gradients as unbiased estimators of the population gradient. In high-dimensional settings, however, classical results from statistical decision theory show that unbiased estimators are generally inadmissible under quadratic loss, suggesting that standard stochastic gradients may be suboptimal from a risk perspective. In this work, we formulate stochastic gradient computation as a high-dimensional estimation problem and introduce a decision-theoretic framework based on Stein-rule shrinkage. We construct a shrinkage gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable restricted estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging second-moment statistics commonly maintained by adaptive optimization methods. Under a Gaussian noise model and for dimension p>=3, we show that the proposed estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal in the classical decision-theoretic sense. We further demonstrate how this estimator can be incorporated into the Adam optimizer, yielding a practical algorithm with negligible additional computational cost. Empirical evaluations on CIFAR10 and CIFAR100, across multiple levels of label noise, show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that the gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled and effective approach to improving stochastic gradient estimation in modern deep learning.

翻译：随机梯度方法是大规模学习的核心，然而其分析通常将小批量梯度视为总体梯度的无偏估计量。但在高维设定下，统计决策理论的经典结果表明，在二次损失下无偏估计量通常是不可容许的，这意味着从风险视角看标准随机梯度可能并非最优。本研究将随机梯度计算构建为高维估计问题，并引入基于斯坦规则收缩的决策理论框架。我们构建了一种收缩梯度估计器，能够自适应地将噪声小批量梯度向源自历史动量的稳定受限估计量收缩。收缩强度通过使用梯度噪声方差的在线估计以数据驱动方式确定，该方法利用了自适应优化方法通常维护的二阶矩统计量。在高斯噪声模型下且维度p≥3时，我们证明所提出的估计量在平方误差损失下一致优于标准随机梯度，并在经典决策理论意义下达到极小极大最优。我们进一步展示了如何将该估计器融入Adam优化器，形成计算开销可忽略的实用算法。在CIFAR10和CIFAR100数据集上，针对多级标签噪声的实证评估表明，在大批量训练场景中该方法相对Adam持续获得改进。消融研究表明，性能提升主要源于对高维卷积层选择性应用收缩，而对所有参数的无差别收缩则会降低性能。这些结果说明经典收缩原则为改进现代深度学习中的随机梯度估计提供了理论严谨且行之有效的方法。