We study trade-offs between the population risk curvature, geometry of the noise, and preconditioning on the generalisation ability of the multipass Preconditioned Stochastic Gradient Descent (PSGD). Many practical optimisation heuristics implicitly navigate this trade-off in different ways -- for instance, some aim to whiten gradient noise, while others aim to align updates with expected loss curvature. When the geometry of the population risk curvature and the geometry of the gradient noise do not match, an aggressive choice that improves one aspect can amplify instability along the other, leading to suboptimal statistical behavior. In this paper we employ on-average algorithmic stability to connect generalisation of PSGD to the effective dimension that depends on these sources of curvature. While existing techniques for on-average stability of SGD are limited to a single pass, as first contribution we develop a new on-average stability analysis for multipass SGD that handles the correlations induced by data reuse. This allows us to derive excess risk bounds that depend on the effective dimension. In particular, we show that an improperly chosen preconditioner can yield suboptimal effective dimension dependence in both optimisation and generalisation. Finally, we complement our upper bounds with matching, instance-dependent lower bounds.
翻译:我们研究了多轮预条件随机梯度下降(PSGD)泛化能力中总体风险曲率、噪声几何特性与预条件处理之间的权衡关系。许多实际优化启发式方法以不同方式隐式地处理这种权衡——例如,某些方法旨在白化梯度噪声,而其他方法则试图使更新方向与期望损失曲率对齐。当总体风险曲率的几何特性与梯度噪声的几何特性不匹配时,改进某一方面的激进选择可能放大另一方面的不稳定性,导致次优的统计行为。本文采用平均算法稳定性方法,将PSGD的泛化性能与取决于这些曲率来源的有效维度建立联系。现有针对SGD平均稳定性的分析技术仅限于单轮训练,作为首要贡献,我们为多轮SGD开发了新的平均稳定性分析框架,该框架能够处理数据重用引发的相关性。这使得我们能够推导出依赖于有效维度的超额风险界。特别地,我们证明了不当选择的预条件器可能导致优化和泛化两方面都产生次优的有效维度依赖性。最后,我们通过匹配的实例相关下界对所得上界进行了补充。