Understanding how training data shape neural network predictions is a central problem in modern learning theory. In 2020, Pedro Domingos proposed an interpolation formula valid for every model learned by deterministic gradient descent. It expresses the model's prediction as an integral, along the optimization path, of a data-dependent kernel that aligns the model's gradients at the test and training data. Such a first-order characterization remains valid for models trained with batch-based stochastic optimization. In this paper, we develop second-order forms of these interpolation formulas. We show that the leading path-kernel interpolation is supplemented by a curvature-weighted interpolation term. For stochastic gradient descent, an additional sampling-induced component appears, coupling the curvature of the prediction with the covariance of mini-batch gradient noise. We also extend the representation to stochastic gradient descent with momentum, where the interpolation structure is preserved but with the weights modified by a memory-related factor. Moreover, we establish a concentration estimate for the terminal prediction, identifying the fluctuation scale around the expected second-order representation. Together, these results provide a refinement of the path-kernel interpretation of neural network prediction.
翻译:理解训练数据如何影响神经网络预测是现代学习理论的核心问题。2020年,Pedro Domingos提出了一种适用于所有通过确定性梯度下降学习的模型的插值公式。该公式将模型预测表示为沿优化路径积分的数据依赖核函数,该核函数对齐了模型在测试数据与训练数据上的梯度。这种一阶描述对于采用批次随机优化训练的模型仍然成立。本文提出了这些插值公式的二阶形式。我们证明主路径核插值项被一个曲率加权插值项补充。对于随机梯度下降,出现了额外的采样诱导分量,将预测曲率与小批量梯度噪声的协方差耦合起来。我们还将表示方法推广至带动量的随机梯度下降,其中插值结构保持不变,但权重因子被记忆相关项修正。进一步地,我们建立了终端预测的浓度估计,识别了围绕二阶期望表示的波动尺度。这些结果共同完善了神经网络预测的路径核解释。