We extend the global convergence result of Chatterjee \cite{chatterjee2022convergence} by considering the stochastic gradient descent (SGD) for non-convex objective functions. With minimal additional assumptions that can be realized by finitely wide neural networks, we prove that if we initialize inside a local region where the \L{}ajasiewicz condition holds, with a positive probability, the stochastic gradient iterates converge to a global minimum inside this region. A key component of our proof is to ensure that the whole trajectories of SGD stay inside the local region with a positive probability. For that, we assume the SGD noise scales with the objective function, which is called machine learning noise and achievable in many real examples. Furthermore, we provide a negative argument to show why using the boundedness of noise with Robbins-Monro type step sizes is not enough to keep the key component valid.
翻译:我们将Chatterjee \cite{chatterjee2022convergence}的全局收敛结果推广至非凸目标函数的随机梯度下降(SGD)。通过有限宽度神经网络可实现的最小额外假设,我们证明:若初始化位于Lojasiewicz条件成立的局部区域内,则随机梯度迭代以正概率收敛至该区域内的全局最小值。证明的关键在于确保SGD的完整轨迹以正概率停留在该局部区域内。为此,我们假设SGD噪声与目标函数成比例——这种被称为机器学习噪声的特性在众多实际案例中均可实现。此外,我们通过反面论证揭示了为何使用Robbins-Monro型步长下的有界噪声不足以维持该关键性质的有效性。