Stochastic gradient methods are central to large-scale learning, but they treat mini-batch gradients as unbiased estimators, which classical decision theory shows are inadmissible in high dimensions. We formulate gradient computation as a high-dimensional estimation problem and introduce a framework based on Stein-rule shrinkage. We construct a gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging statistics from adaptive optimizers. Under a Gaussian noise model, we show our estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal. We incorporate this into the Adam optimizer, yielding SR-Adam, a practical algorithm with negligible computational cost. Empirical evaluations on CIFAR10 and CIFAR100 across multiple levels of input noise show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled approach to improving stochastic gradient estimation in deep learning.
翻译:随机梯度方法是大规模学习的核心,但它们将小批量梯度视为无偏估计量,而经典决策理论表明这种估计量在高维情况下是不可容许的。我们将梯度计算表述为一个高维估计问题,并引入一个基于斯坦规则收缩的框架。我们构建了一种梯度估计器,它能自适应地将噪声小批量梯度向由历史动量导出的稳定估计量收缩。收缩强度以数据驱动的方式确定,利用自适应优化器的统计量在线估计梯度噪声方差。在高斯噪声模型下,我们证明了该估计器在平方误差损失下一致优于标准随机梯度,且具有极小极大最优性。我们将其融入Adam优化器,得到SR-Adam——一种计算成本可忽略的实用算法。在CIFAR10和CIFAR100数据集上,针对多种输入噪声水平的实证评估表明,在大批量训练场景下,该方法相比Adam取得了持续的改进。消融研究表明,性能提升主要源于选择性地对高维卷积层应用收缩,而对所有参数不加区分的收缩则会降低性能。这些结果表明,经典收缩原理为改进深度学习中的随机梯度估计提供了一种原则性方法。