In this paper, we provide a novel framework for the analysis of generalization error of first-order optimization algorithms for statistical learning when the gradient can only be accessed through partial observations given by an oracle. Our analysis relies on the regularity of the gradient w.r.t. the data samples, and allows to derive near matching upper and lower bounds for the generalization error of multiple learning problems, including supervised learning, transfer learning, robust learning, distributed learning and communication efficient learning using gradient quantization. These results hold for smooth and strongly-convex optimization problems, as well as smooth non-convex optimization problems verifying a Polyak-Lojasiewicz assumption. In particular, our upper and lower bounds depend on a novel quantity that extends the notion of conditional standard deviation, and is a measure of the extent to which the gradient can be approximated by having access to the oracle. As a consequence, our analysis provides a precise meaning to the intuition that optimization of the statistical learning objective is as hard as the estimation of its gradient. Finally, we show that, in the case of standard supervised learning, mini-batch gradient descent with increasing batch sizes and a warm start can reach a generalization error that is optimal up to a multiplicative factor, thus motivating the use of this optimization scheme in practical applications.
翻译:本文提出了一种新颖框架,用于分析统计学习中一阶优化算法的泛化误差——当梯度仅能通过预言机提供的部分观测值获取时。我们的分析依赖于梯度相对于数据样本的正则性,并能够推导出多个学习问题(包括监督学习、迁移学习、鲁棒学习、分布式学习以及基于梯度量化的通信高效学习)泛化误差的紧致上下界。这些结果适用于光滑强凸优化问题,以及满足Polyak-Lojasiewicz假设的光滑非凸优化问题。特别地,我们的上下界依赖于一个推广了条件标准差概念的新变量,该变量度量了通过访问预言机近似梯度的程度。由此,我们的分析为“统计学习目标的优化难度等同于其梯度估计难度”这一直觉提供了精确解释。最后,我们证明在标准监督学习情形下,采用递增批次尺寸和热启动的小批量梯度下降能够达到仅相差常系数乘子的最优泛化误差,从而为该优化方案在实际应用中的使用提供了理论依据。