Prior work has identified several factors that can contribute to the performance gap between Adam and SGD, spanning data aspects, architecture design, and optimization properties. Yet these explanations are often studied in isolation, leaving their relative importance unclear. In this work, we revisit these hypotheses through a controlled empirical study across vision, language, genomics, and graph tasks, spanning modern and classical architectures, and carefully designed training setups. Our results suggest that no single factor consistently explains the Adam--SGD gap. For instance, the Adam advantage can (1) persist under a uniform vocabulary distribution yet nearly disappear under a heavy-tailed one; (2) reverse in favor of SGD in softmax-attention models; and (3) become larger under soft architectural modifications, e.g., when ReLU is replaced by a GeLU nonlinearity. This suggests that the gap arises from nontrivial data and architecture interactions, rather than from a single common factor. Yet, we observe a pattern across our settings: a \emph{crossover batch size} at which the relative advantage shifts from SGD to Adam as the batch size scales. These empirical results are captured by our theoretical gap model, which predicts this batch-size-dependent crossover. Our perspective helps reconcile several existing hypotheses while offering practical insights across domains.
翻译:先前研究已识别出多个可能导致Adam与SGD性能差异的因素,涵盖数据特性、架构设计及优化属性等方面。然而,这些解释通常被孤立研究,其相对重要性尚未明确。本研究通过控制变量的实证方法,在视觉、语言、基因组学及图任务中,结合现代与经典架构及精心设计的训练设置,重新审视这些假设。结果表明,没有任何单一因素能始终如一地解释Adam与SGD的差异。例如,Adam的优势(1)在均匀词汇分布下持续存在,但在重尾分布下几乎消失;(2)在Softmax注意力模型中逆转,转而利于SGD;以及(3)在架构轻微修改(如将ReLU替换为GeLU非线性激活函数)时进一步扩大。这表明该差异源于数据与架构间的非平凡交互作用,而非单一共同因素。然而,我们在所有设置中观察到一个规律:存在一个"交叉批量大小",在此阈值下,相对优势随批量规模增大从SGD转向Adam。这些实证结果被我们的理论差异模型所捕捉,该模型成功预测了这种与批量大小相关的交叉现象。本研究视角有助于调和现有多种假设,并为各领域提供实践性见解。