This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We provide an accessible and detailed analysis of the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, adaptive epsilon, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving state-of-the-art results in ASR.
翻译:本文为Adam优化器建立了数学基础,通过黎曼几何与信息几何阐明了其与自然梯度下降的联系。我们对Adam中的对角经验Fisher信息矩阵(FIM)进行了清晰而详尽的分析,阐明了所有细节近似,并基于经验FIM的局限性主张使用基于离散分布的对数概率函数作为损失函数。我们的分析揭示了原始Adam算法中的缺陷,据此提出了改进方案,包括增强的动量计算、调整后的偏差校正、自适应epsilon以及梯度裁剪。基于理论框架,我们进一步优化了权重衰减项。改进后的算法——Fisher Adam(FAdam)——在包括LLM、ASR和VQ-VAE在内的多个领域展现出卓越性能,并在ASR任务中取得了最先进的结果。