The widely used stochastic gradient methods for minimizing nonconvex composite objective functions require the Lipschitz smoothness of the differentiable part. But the requirement does not hold true for problem classes including quadratic inverse problems and training neural networks. To address this issue, we investigate a family of stochastic Bregman proximal gradient (SBPG) methods, which only require smooth adaptivity of the differentiable part. SBPG replaces the upper quadratic approximation used in SGD with the Bregman proximity measure, resulting in a better approximation model that captures the non-Lipschitz gradients of the nonconvex objective. We formulate the vanilla SBPG and establish its convergence properties under nonconvex setting without finite-sum structure. Experimental results on quadratic inverse problems testify the robustness of SBPG. Moreover, we propose a momentum-based version of SBPG (MSBPG) and prove it has improved convergence properties. We apply MSBPG to the training of deep neural networks with a polynomial kernel function, which ensures the smooth adaptivity of the loss function. Experimental results on representative benchmarks demonstrate the effectiveness and robustness of MSBPG in training neural networks. Since the additional computation cost of MSBPG compared with SGD is negligible in large-scale optimization, MSBPG can potentially be employed as an universal open-source optimizer in the future.
翻译:广泛使用的随机梯度方法在最小化非凸复合目标函数时,需要可微部分的Lipschitz光滑性。但该要求对于包括二次反问题和神经网络训练在内的问题类别并不成立。为解决此问题,我们研究了一类随机Bregman邻近梯度(SBPG)方法,该方法仅需可微部分的光滑自适应性。SBPG用Bregman邻近度量替代SGD中使用的上二次逼近,从而构建更优的逼近模型以捕捉非凸目标的非Lipschitz梯度。我们提出了基本SBPG方法,并在无有限和结构的非凸设定下建立了其收敛性质。二次反问题的实验结果验证了SBPG的鲁棒性。此外,我们提出了基于动量的SBPG版本(MSBPG),并证明其具有更优的收敛性质。我们将MSBPG应用于具有多项式核函数的深度神经网络训练,该核函数确保了损失函数的光滑自适应性。代表性基准实验结果表明,MSBPG在训练神经网络时具有有效性和鲁棒性。由于在大规模优化中,MSBPG相较SGD的额外计算成本可忽略不计,未来MSBPG有望被用作通用的开源优化器。