Stochastic Gradient Descent (SGD) is arguably the most important single algorithm in modern machine learning. Although SGD with unbiased gradient estimators has been studied extensively over at least half a century, SGD variants relying on biased estimators are rare. Nevertheless, there has been an increased interest in this topic in recent years. However, existing literature on SGD with biased estimators (BiasedSGD) lacks coherence since each new paper relies on a different set of assumptions, without any clear understanding of how they are connected, which may lead to confusion. We address this gap by establishing connections among the existing assumptions, and presenting a comprehensive map of the underlying relationships. Additionally, we introduce a new set of assumptions that is provably weaker than all previous assumptions, and use it to present a thorough analysis of BiasedSGD in both convex and non-convex settings, offering advantages over previous results. We also provide examples where biased estimators outperform their unbiased counterparts or where unbiased versions are simply not available. Finally, we demonstrate the effectiveness of our framework through experimental results that validate our theoretical findings.
翻译:随机梯度下降(SGD)可以说是现代机器学习中最重要的单一算法。尽管对具有无偏梯度估计器的SGD已有至少半个世纪的广泛研究,但依赖有偏估计器的SGD变体却较为罕见。然而,近年来该主题受到的关注日益增加。但现有关于有偏估计器SGD(BiasedSGD)的文献缺乏一致性,因为每篇新论文都依赖于不同的假设集,且对它们之间的关联缺乏清晰理解,这可能引发混淆。我们通过建立现有假设之间的联系,并呈现潜在关系的完整图谱来填补这一空白。此外,我们引入了一个可证明比所有先前假设更弱的新假设集,并基于此在有凸和非凸场景下对BiasedSGD进行了全面分析,提供了优于先前结果的优势。我们还给出了有偏估计器优于无偏对应物或根本无法使用无偏版本的实例。最后,我们通过验证理论结果的实验展示了本框架的有效性。