We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.
翻译:我们揭示了SGD与批归一化相互作用时可能表现出的不良训练动态,如发散。具体而言,我们研究了单次洗牌(SS)和随机重排(RR)——两种广泛使用的SGD变体——在存在批归一化的情况下如何表现出令人惊讶的不同行为:RR比SS能更稳定地控制训练损失的变化。以使用批归一化的线性网络回归为例,我们证明SS和RR收敛到不同的全局最优解,这些解相对于梯度下降法发生了“扭曲”。此后,针对分类任务,我们刻画了SS和RR可能或不可能出现训练发散的条件。我们通过显式构造说明:SS会导致回归中产生扭曲的最优解以及分类中的发散,而RR同时避免了扭曲与发散。我们通过实际场景中的实证验证了这些结论,并得出结论:在使用批归一化时,SS与RR之间的差异在实际应用中具有重要意义。