Delayed loss spikes have been reported in neural-network training, but existing theory mainly explains earlier non-monotone behavior caused by overly large fixed learning rates. We study one stylized hypothesis: normalization can postpone instability by gradually increasing the effective learning rate during otherwise stable descent. To test this hypothesis at theorem level, we analyze batch-normalized linear models. Our flagship result concerns whitened square-loss linear regression, where we derive explicit no-rising-edge and delayed-onset conditions, bound the waiting time to directional onset, and show that the rising edge self-stabilizes within finitely many iterations. Combined with a square-loss decomposition, this yields a concrete delayed-spike mechanism in the whitened regime. For logistic regression, under highly restrictive active-margin assumptions, we prove only a supporting finite-horizon directional precursor in a knife-edge regime, with an optional appendix-only loss lower bound under an extra non-degeneracy condition. The paper should therefore be read as a stylized mechanism study rather than a general explanation of neural-network loss spikes. Within that scope, the results isolate one concrete delayed-instability pathway induced by batch normalization.
翻译:延迟损失尖峰已在神经网络训练中被观测到,但现有理论主要解释由过大的固定学习率导致的早期非单调行为。我们研究一种简化假设:归一化可以通过在原本稳定下降过程中逐步提高有效学习率来推迟不稳定性。为了在理论层面检验该假设,我们分析了批归一化线性模型。核心结果涉及白化平方损失线性回归:我们推导出显式的无上升沿条件与延迟启动条件,给出了方向性启动的等待时间上界,并证明了上升沿在有限次迭代内可自稳定。结合平方损失分解,这为白化场景构建了具体的延迟尖峰机制。对于逻辑回归,在高度受限的主动间隔假设下,我们仅在刀刃场景中证明了有限时域方向性前兆的存在性,并在额外非退化条件下通过附录给出了损失下界。因此,本文应视为一种简化机制研究,而非神经网络损失尖峰的通用解释。在此范围内,这些结果隔离出由批归一化引发的一种具体延迟不稳定性路径。