Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice and plays an important role in the generalization of modern machine learning. However, prior research has revealed instances where the generalization performance of SGD is worse than ridge regression due to uneven optimization along different dimensions. Preconditioning offers a natural solution to this issue by rebalancing optimization across different directions. Yet, the extent to which preconditioning can enhance the generalization performance of SGD and whether it can bridge the existing gap with ridge regression remains uncertain. In this paper, we study the generalization performance of SGD with preconditioning for the least squared problem. We make a comprehensive comparison between preconditioned SGD and (standard \& preconditioned) ridge regression. Our study makes several key contributions toward understanding and improving SGD with preconditioning. First, we establish excess risk bounds (generalization performance) for preconditioned SGD and ridge regression under an arbitrary preconditions matrix. Second, leveraging the excessive risk characterization of preconditioned SGD and ridge regression, we show that (through construction) there exists a simple preconditioned matrix that can outperform (standard \& preconditioned) ridge regression. Finally, we show that our proposed preconditioning matrix is straightforward enough to allow robust estimation from finite samples while maintaining a theoretical advantage over ridge regression. Our empirical results align with our theoretical findings, collectively showcasing the enhanced regularization effect of preconditioned SGD.
翻译:随机梯度下降(SGD)在实践中展现出显著的算法正则化效应,对现代机器学习的泛化性能具有重要作用。然而,先前研究表明,由于沿不同维度的优化不均匀,SGD的泛化性能在某些情况下劣于岭回归。预条件通过重新平衡不同方向上的优化,为解决该问题提供了自然方案。但预条件能在多大程度上提升SGD的泛化性能,以及能否弥合其与岭回归之间的现有差距,目前尚不明确。本文研究最小二乘问题中带预条件的SGD的泛化性能,对预条件SGD与(标准及预条件)岭回归进行了全面比较。我们的研究为理解和改进预条件SGD做出了若干关键贡献:首先,在任意预条件矩阵下,建立了预条件SGD与岭回归的超额风险界(泛化性能);其次,基于预条件SGD与岭回归的超额风险刻画,通过构造证明了存在一种简单的预条件矩阵能优于(标准及预条件)岭回归;最后,表明我们提出的预条件矩阵足够简洁,可在保持理论优势的同时,通过有限样本进行稳健估计。实验结果与理论发现一致,共同展示了预条件SGD增强的正则化效应。