We establish matching upper and lower generalization error bounds for mini-batch Gradient Descent (GD) training with either deterministic or stochastic, data-independent, but otherwise arbitrary batch selection rules. We consider smooth Lipschitz-convex/nonconvex/strongly-convex loss functions, and show that classical upper bounds for Stochastic GD (SGD) also hold verbatim for such arbitrary nonadaptive batch schedules, including all deterministic ones. Further, for convex and strongly-convex losses we prove matching lower bounds directly on the generalization error uniform over the aforementioned class of batch schedules, showing that all such batch schedules generalize optimally. Lastly, for smooth (non-Lipschitz) nonconvex losses, we show that full-batch (deterministic) GD is essentially optimal, among all possible batch schedules within the considered class, including all stochastic ones.
翻译:我们建立了小批量梯度下降(GD)训练中,采用确定性或随机、数据独立但任意批量选择规则时的匹配上下泛化误差界。考虑光滑的Lipschitz凸/非凸/强凸损失函数,结果表明,随机梯度下降(SGD)的经典上界对于此类任意非自适应批量调度(包括所有确定性调度)同样成立。进一步,对于凸和强凸损失,我们直接在所述批量调度类别上证明了泛化误差的匹配下界,表明所有此类批量调度均能实现最优泛化。最后,对于光滑(非Lipschitz)非凸损失,我们证明在所考虑类别中所有可能批量调度(包括所有随机调度)中,全批量(确定性)GD本质上是最优的。