We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-order algorithm, called Mini-Batch Stochastic Variance-Reduced Newton ($\texttt{Mb-SVRN}$), which combines variance-reduced gradient estimates with access to an approximate Hessian oracle. In particular, we show that when the data size $n$ is sufficiently large, i.e., $n\gg \alpha^2\kappa$, where $\kappa$ is the condition number and $\alpha$ is the Hessian approximation factor, then $\texttt{Mb-SVRN}$ achieves a fast linear convergence rate that is independent of the gradient mini-batch size $b$, as long $b$ is in the range between $1$ and $b_{\max}=O(n/(\alpha \log n))$. Only after increasing the mini-batch size past this critical point $b_{\max}$, the method begins to transition into a standard Newton-type algorithm which is much more sensitive to the Hessian approximation quality. We demonstrate this phenomenon empirically on benchmark optimization tasks showing that, after tuning the step size, the convergence rate of $\texttt{Mb-SVRN}$ remains fast for a wide range of mini-batch sizes, and the dependence of the phase transition point $b_{\max}$ on the Hessian approximation factor $\alpha$ aligns with our theoretical predictions.
翻译:我们证明,在有限和最小化问题中,融入目标函数的部分二阶信息可显著提高方差缩减随机梯度方法对小批量大小的鲁棒性,使其在保持对传统牛顿类方法优势的同时更具可扩展性。我们通过一个原型随机二阶算法——小批量随机方差缩减牛顿法($\texttt{Mb-SVRN}$)阐述这一现象,该算法结合了方差缩减梯度估计与近似黑塞矩阵查询。特别地,我们表明当数据规模 $n$ 足够大(即 $n\gg \alpha^2\kappa$,其中 $\kappa$ 为条件数,$\alpha$ 为黑塞矩阵近似因子)时,只要梯度小批量大小 $b$ 处于 $1$ 到 $b_{\max}=O(n/(\alpha \log n))$ 范围内,$\texttt{Mb-SVRN}$ 可实现与 $b$ 无关的快速线性收敛率。仅当小批量大小超过该临界点 $b_{\max}$ 后,该方法才开始转变为对黑塞矩阵近似质量更为敏感的标准牛顿类算法。我们在基准优化任务上通过数值实验验证了这一现象:在调整步长后,$\texttt{Mb-SVRN}$ 的收敛率在广泛的小批量大小范围内保持快速,且相变点 $b_{\max}$ 对黑塞矩阵近似因子 $\alpha$ 的依赖性与理论预测一致。