Stochastic gradient descent (SGD) performed in an asynchronous manner plays a crucial role in training large-scale machine learning models. However, the generalization performance of asynchronous delayed SGD, which is an essential metric for assessing machine learning algorithms, has rarely been explored. Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization. In this paper, we investigate sharper generalization error bound for SGD with asynchronous delay $\tau$. Leveraging the generating function analysis tool, we first establish the average stability of the delayed gradient algorithm. Based on this algorithmic stability, we provide upper bounds on the generalization error of $\tilde{\mathcal{O}}(\frac{T-\tau}{n\tau})$ and $\tilde{\mathcal{O}}(\frac{1}{n})$ for quadratic convex and strongly convex problems, respectively, where $T$ refers to the iteration number and $n$ is the amount of training data. Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm. Analogous analysis can be generalized to the random delay setting, and the experimental results validate our theoretical findings.
翻译:异步执行的随机梯度下降(SGD)在训练大规模机器学习模型中发挥着关键作用。然而,异步延迟SGD的泛化性能——这一评估机器学习算法的重要指标——却鲜少被探索。现有的泛化误差界较为悲观,且无法揭示异步延迟与泛化之间的关联。本文研究了具有异步延迟τ的SGD的更紧泛化误差界。借助生成函数分析工具,我们首先建立了延迟梯度算法的平均稳定性。基于这一算法稳定性,我们分别给出了二次凸问题和强凸问题下泛化误差的上界:$\tilde{\mathcal{O}}(\frac{T-\tau}{n\tau})$ 和 $\tilde{\mathcal{O}}(\frac{1}{n})$,其中$T$表示迭代次数,$n$为训练数据量。我们的理论结果表明,异步延迟降低了延迟SGD算法的泛化误差。类似的分析可推广至随机延迟设置,且实验结果验证了我们的理论发现。