Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.
翻译:理解梯度方法(特别是随机梯度下降)的局限性是学习理论中的一个核心挑战。为此,统计查询框架是一种常用工具,它研究基于与数据的噪声交互的算法性能极限。然而,已知统计查询框架与随机梯度下降之间的形式化联系是薄弱的:现有结果通常依赖于对抗性或特殊结构的梯度噪声,这些噪声并不能反映标准随机梯度下降中的噪声,并且(正如我们在此指出的)有时会导致错误的预测。此外,许多针对挑战性问题的随机梯度下降分析依赖于非平凡的算法修改,例如将随机梯度下降轨迹限制在球面上或使用非常小的学习率。为了解决这些不足,我们开发了一个新的非统计查询框架,用于研究标准原始随机梯度下降在单索引和多索引模型(即当目标函数依赖于输入的低维投影时)中的局限性。我们的结果适用于广泛的设置和架构,包括(可能为深层的)神经网络。