Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.
翻译:子采样自然梯度下降(SNG)已被用于实现高精度科学机器学习,但基于随机预条件处理的标准分析方法未能为现实小样本场景提供有效洞见。我们通过将SNG重新解析为草图与投影方法克服了这一局限。受此视角启发,我们摒弃了传统理论代理方法——该方法通过两个独立小批量解耦梯度与预条件器,并代之以基于平方体积采样的新代理框架。在此新框架下,我们证明即使存在耦合效应,SNG方向的期望值仍等价于预条件梯度下降步骤,从而获得:(i)使用任意规模单小批量时的全局收敛保证;(ii)通过草图与投影结构相关量值明确表征的收敛速率。这些发现进一步为小样本场景提供了新见解,例如表明SNG相对于随机梯度下降的优势在于能更有效地利用模型雅可比矩阵的谱衰减特性。我们还将这些思路拓展至解释SNG的经典结构化动量方案(即SPRING),通过论证其自然产生于加速草图与投影方法而揭示其内在机理。