In this paper, we identify and analyze a recurring training loss pattern, which we term the \textit{Epochal Sawtooth Effect (ESE)}, commonly observed during training with adaptive gradient-based optimizers, particularly Adam optimizer. This pattern is characterized by a sharp drop in loss at the beginning of each epoch, followed by a gradual increase, resulting in a sawtooth-shaped loss curve. Through empirical observations, we demonstrate that while this effect is most pronounced with Adam, it persists, although less severely, with other optimizers such as RMSProp. We provide an in-depth explanation of the underlying mechanisms that lead to the Epochal Sawtooth Effect. The influences of factors like $\beta$, batch size, data shuffling on this pattern have been studied. We quantify the influence of $beta_2$ on the shape of the loss curve, showing that higher values of $\beta_2$ result in a nearly linear increase in loss, while lower values create a concave upward trend. Our analysis reveals that this behavior stems from the adaptive learning rate controlled by the second moment estimate, with $\beta_1$ playing a minimal role when $\beta_2$ is large. To support our analysis, we replicate this phenomenon through a controlled quadratic minimization task. By incrementally solving a series of quadratic optimization problems using Adam, we demonstrate that the Epochal Sawtooth Effect can emerge even in simple optimization scenarios, reinforcing the generality of this pattern. This paper provides both theoretical insights and quantitative analysis, offering a comprehensive understanding of this ubiquitous phenomenon in modern optimization techniques.
翻译:本文识别并分析了一种在自适应梯度优化器(尤其是Adam优化器)训练过程中常见的周期性训练损失模式,我们将其称为\textit{时代性锯齿效应(ESE)}。该模式的特征是每个训练时代开始时损失急剧下降,随后逐渐上升,从而形成锯齿状的损失曲线。通过实证观察,我们证明虽然该效应在Adam优化器中最为显著,但在其他优化器(如RMSProp)中同样存在,只是程度较轻。我们深入解释了导致时代性锯齿效应的内在机制,研究了$\beta$参数、批次大小、数据洗牌等因素对该模式的影响。我们量化了$\beta_2$对损失曲线形状的影响,表明较高的$\beta_2$值会导致损失近乎线性增长,而较低值则会产生向上凹的曲线趋势。分析揭示该行为源于由二阶矩估计控制的自适应学习率,当$\beta_2$较大时$\beta_1$的影响微乎其微。为验证分析,我们通过受控的二次最小化任务复现了该现象。通过使用Adam逐步求解一系列二次优化问题,我们证明即使在简单优化场景中也会出现时代性锯齿效应,从而强化了该模式的普适性。本文提供了理论洞见与量化分析,为理解现代优化技术中这一普遍现象提供了全面视角。