This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We rigorously analyze the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving state-of-the-art results in ASR.
翻译:本文为Adam优化器建立了数学基础,通过黎曼几何与信息几何阐明了其与自然梯度下降的联系。我们严格分析了Adam中的对角经验Fisher信息矩阵(FIM),澄清了所有近似细节,并主张使用基于离散分布的对数概率函数作为损失函数——这是由经验FIM的局限性所决定的。分析揭示了原始Adam算法存在的缺陷,并提出了相应的修正方案,包括增强动量计算、调整偏差修正以及梯度裁剪。基于理论框架,我们改进了权重衰减项。改进后的算法Fisher Adam(FAdam)在包括大语言模型(LLM)、自动语音识别(ASR)和矢量量化变分自编码器(VQ-VAE)在内的多个领域展现出卓越性能,并在ASR任务中达到了最先进水平。