This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-{\L}ojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.
翻译:本文重新审视了随机镜像下降(SMD)在现代非凸优化框架中的收敛性。现有的无批次非凸SMD结果限制了距离生成函数(DGF)的选择,要求其可微且具有Lipschitz连续梯度,从而排除了香农熵等重要情形。本文提出了一种支持一般DGF的非凸SMD新收敛分析,克服了上述局限,且仅依赖于标准假设。此外,我们的收敛性是基于布雷格曼前向后向包络建立的,这一度量比常用的梯度映射平方范数更强。我们进一步将结果扩展到次高斯噪声下的高概率收敛以及广义布雷格曼邻近Polyak-Łojasiewicz条件下的全局收敛。同时,通过利用非光滑DGF,我们展示了改进的SMD理论在各种非凸机器学习任务中的优势。值得注意的是,在非凸差分隐私(DP)学习背景下,我们的理论产生了一种具有(近乎)维度无关效用界限的简洁算法。针对线性神经网络训练问题,我们开发了可证明收敛的随机算法。