We introduce a clipping strategy for Stochastic Gradient Descent (SGD) which uses quantiles of the gradient norm as clipping thresholds. We prove that this new strategy provides a robust and efficient optimization algorithm for smooth objectives (convex or non-convex), that tolerates heavy-tailed samples (including infinite variance) and a fraction of outliers in the data stream akin to Huber contamination. Our mathematical analysis leverages the connection between constant step size SGD and Markov chains and handles the bias introduced by clipping in an original way. For strongly convex objectives, we prove that the iteration converges to a concentrated distribution and derive high probability bounds on the final estimation error. In the non-convex case, we prove that the limit distribution is localized on a neighborhood with low gradient. We propose an implementation of this algorithm using rolling quantiles which leads to a highly efficient optimization procedure with strong robustness properties, as confirmed by our numerical experiments.
翻译:我们提出了一种用于随机梯度下降(SGD)的裁剪策略,该策略使用梯度范数的分位数作为裁剪阈值。我们证明,对于光滑目标函数(凸或非凸),这种新策略能提供一种鲁棒且高效的优化算法,可容忍重尾样本(包括无限方差)以及数据流中类似于Huber污染的一部分异常值。我们的数学分析利用了恒定步长SGD与马尔可夫链之间的联系,并以一种新颖的方式处理了裁剪引入的偏差。对于强凸目标,我们证明了迭代收敛到集中分布,并推导了最终估计误差的高概率界。在非凸情况下,我们证明极限分布定位于梯度较小的邻域内。我们提出了一种使用滚动分位数实现该算法的方法,从而得到一种具有强鲁棒性的高效优化过程,数值实验证实了这一点。