Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $\mathbf{C}$ is proportional to the Hessian $\mathbf{H}$. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, where $\mathbf{h}_p$ denotes the per-sample Hessian with $\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$. As a consequence, $\mathbf{C}$ and $\mathbf{H}$ commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation $C_{ii} \propto H_{ii}^γ$ with a theoretically bounded exponent $1 \leq γ\leq 2$, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.
翻译:随机梯度下降(SGD)引入了各向异性的噪声,该噪声与损失景观的局部曲率相关,从而将优化过程偏向平坦的极小值。先前的研究通常假设费舍尔信息矩阵与负对数似然损失的Hessian矩阵等价,进而声称SGD噪声协方差$\mathbf{C}$与Hessian矩阵$\mathbf{H}$成正比。我们证明,这一假设仅在严格条件下成立,而这些条件在深度神经网络中通常被违背。利用最近发现的活性-权重对偶性,我们找到了一个更普遍的关系,该关系不依赖于具体的损失函数形式,表明$\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$,其中$\mathbf{h}_p$表示每个样本的Hessian矩阵,且$\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$。因此,$\mathbf{C}$与$\mathbf{H}$近似可交换而非精确重合,并且它们的对角元素遵循一个近似的幂律关系$C_{ii} \propto H_{ii}^γ$,其中理论有界的指数$1 \leq γ\leq 2$由每个样本Hessian矩阵的谱决定。跨数据集、架构和损失函数的实验验证了这些界限,为深度学习中的噪声-曲率关系提供了一个统一的描述。