We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.
翻译:我们揭示了SGD与批量归一化交互时可能展现出的不良训练动态,例如发散现象。具体而言,我们研究了两种广泛使用的SGD变体——单次洗牌(SS)与随机重排(RR)——在批量归一化存在时的显著差异:RR导致训练损失的演化比SS更稳定。以线性网络结合批量归一化的回归任务为例,我们证明了SS与RR收敛到偏离梯度下降方向的"扭曲"全局最优解。随后,在分类任务中,我们刻画了SS与RR可能或不可能发生训练发散的条件。通过显式构造,我们展示了SS如何在回归中导致扭曲最优解、在分类中引发发散,而RR则能避免这两种现象。我们在实际场景中通过经验验证了理论结果,并得出结论:SS与RR在批量归一化应用中的差异具有实践相关性。