Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling distribution can help neural networks learn parities, with formal results only for large learning rates and one-step arguments. Here we show a separation result in the number of training steps with standard (bounded) learning rates on a common sample distribution: if the data distribution is a mixture of sparse and dense inputs, there exists a regime in which a 2-layer ReLU neural network trained by a curriculum noisy-GD (or SGD) algorithm that uses sparse examples first, can learn parities of sufficiently large degree, while any fully connected neural network of possibly larger width or depth trained by noisy-GD on the unordered samples cannot learn without additional steps. We also provide experimental results supporting the qualitative separation beyond the specific regime of the theoretical results.
翻译:实验结果表明,课程学习(即先呈现简单样本再呈现复杂样本)能提升学习效率。近期部分理论成果亦表明,改变采样分布有助于神经网络学习奇偶性问题,但其形式化论证仅局限于大学习率与单步推导场景。本文针对标准(有界)学习率下常见样本分布的训练步数,证明了如下分离结果:当数据分布为稀疏与稠密输入的混合时,存在一种机制——采用优先使用稀疏样本的课程噪声梯度下降(或随机梯度下降)算法训练的两层ReLU神经网络,可学习足够高阶的奇偶性目标;而任意宽度或深度更大的全连接神经网络,若在无序样本上使用噪声梯度下降训练,则需额外步数才能学习。我们还提供了支持理论结果特定机制之外定性分离的实验证据。