Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.
翻译:子采样自然梯度下降(SNG)已被用于实现高精度科学机器学习,但基于随机预条件算子的标准分析无法为实际小样本场景提供深入见解。我们通过将SNG视为草图投影方法克服了这一局限性。基于这一视角,我们摒弃了通常采用两个独立小批量解耦梯度与预条件算子的理论代理,代之以基于平方体积采样的新代理。在此新代理下,我们证明即使存在耦合,SNG方向的期望仍等价于预条件梯度下降步长,从而:(i) 在任意大小的单小批量使用时提供全局收敛保证;(ii) 通过草图投影结构相关量明确刻画收敛速率。这些发现进而为小样本场景提供新见解,例如表明SNG相对于SGD的优势在于能更有效地利用模型雅可比矩阵的谱衰减。我们还将这些思想扩展到解释SNG的流行结构化动量方案SPRING,证明其自然源于加速草图投影方法。