The widely used stochastic gradient methods for minimizing nonconvex composite objective functions require the Lipschitz smoothness of the differentiable part. But the requirement does not hold true for problem classes including quadratic inverse problems and training neural networks. To address this issue, we investigate a family of stochastic Bregman proximal gradient (SBPG) methods, which only require smooth adaptivity of the differentiable part. SBPG replaces the upper quadratic approximation used in SGD with the Bregman proximity measure, resulting in a better approximation model that captures the non-Lipschitz gradients of the nonconvex objective. We formulate the vanilla SBPG and establish its convergence properties under nonconvex setting without finite-sum structure. Experimental results on quadratic inverse problems testify the robustness of SBPG. Moreover, we propose a momentum-based version of SBPG (MSBPG) and prove it has improved convergence properties. We apply MSBPG to the training of deep neural networks with a polynomial kernel function, which ensures the smooth adaptivity of the loss function. Experimental results on representative benchmarks demonstrate the effectiveness and robustness of MSBPG in training neural networks. Since the additional computation cost of MSBPG compared with SGD is negligible in large-scale optimization, MSBPG can potentially be employed an universal open-source optimizer in the future.
翻译:广泛应用于最小化非凸复合目标函数的随机梯度方法要求可微部分的Lipschitz光滑性,但该条件在二次反问题及神经网络训练等问题类别中并不成立。为解决此问题,我们研究了一族仅需可微部分光滑自适应性的随机Bregman近端梯度(SBPG)方法。SBPG将SGD中使用的上二次逼近替换为Bregman邻近度量,从而获得能捕捉非凸目标非Lipschitz梯度的更优逼近模型。我们提出了基础SBPG算法,并在非有限和结构的非凸设定下建立了其收敛性质。二次反问题的实验验证了SBPG的鲁棒性。此外,我们提出了基于动量的SBPG版本(MSBPG),并证明其具有更优的收敛性质。我们将MSBPG应用于采用多项式核函数的深度神经网络训练,该核函数可确保损失函数的光滑自适应性。在代表性基准上的实验结果表明,MSBPG在神经网络训练中的有效性与鲁棒性。由于在大规模优化中MSBPG相较于SGD的额外计算开销可忽略不计,未来其有望作为通用开源优化器被广泛采用。