We consider non-convex stochastic optimization problems where the objective functions have super-linearly growing and discontinuous stochastic gradients. In such a setting, we provide a non-asymptotic analysis for the tamed unadjusted stochastic Langevin algorithm (TUSLA) introduced in Lovas et al. (2020). In particular, we establish non-asymptotic error bounds for the TUSLA algorithm in Wasserstein-1 and Wasserstein-2 distances. The latter result enables us to further derive non-asymptotic estimates for the expected excess risk. To illustrate the applicability of the main results, we consider an example from transfer learning with ReLU neural networks, which represents a key paradigm in machine learning. Numerical experiments are presented for the aforementioned example which support our theoretical findings. Hence, in this setting, we demonstrate both theoretically and numerically that the TUSLA algorithm can solve the optimization problem involving neural networks with ReLU activation function. Besides, we provide simulation results for synthetic examples where popular algorithms, e.g. ADAM, AMSGrad, RMSProp, and (vanilla) stochastic gradient descent (SGD) algorithm, may fail to find the minimizer of the objective functions due to the super-linear growth and the discontinuity of the corresponding stochastic gradient, while the TUSLA algorithm converges rapidly to the optimal solution. Moreover, we provide an empirical comparison of the performance of TUSLA with popular stochastic optimizers on real-world datasets, as well as investigate the effect of the key hyperparameters of TUSLA on its performance.
翻译:我们考虑目标函数具有超线性增长和不连续随机梯度的非凸随机优化问题。在此类设定下,我们对Lovas等人(2020)提出的驯服非调整随机朗之万算法(TUSLA)进行了非渐近分析。具体而言,我们建立了TUSLA算法在Wasserstein-1距离和Wasserstein-2距离下的非渐近误差界。后者使我们能够进一步推导出期望超额风险的非渐近估计。为说明主要结果的适用性,我们以迁移学习中基于ReLU神经网络的例子作为机器学习的核心范式进行考量。针对前述例子给出了数值实验,验证了我们的理论发现。因此,在该设定下,我们从理论和数值两方面证明了TUSLA算法能够求解包含ReLU激活函数神经网络的优化问题。此外,我们提供了合成示例的仿真结果,其中ADAM、AMSGrad、RMSProp及(原始)随机梯度下降(SGD)等流行算法因随机梯度的超线性增长和不连续性而可能无法找到目标函数的最小值,而TUSLA算法则能快速收敛至最优解。同时,我们在真实数据集上对TUSLA与主流随机优化器的性能进行了实证比较,并探究了TUSLA关键超参数对其性能的影响。