Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementation, the EF approximation has its theoretical and practical limitations. This paper first investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of the poor empirical approximation quality. An improved empirical Fisher (iEF) method, motivated as a generalised NGD method from a loss reduction perspective, is proposed to address this issue, meanwhile retaining the practical convenience of EF. The exact iEF and EF methods are experimentally evaluated using practical deep learning setups, including widely-used setups for parameter-efficient fine-tuning of pre-trained models (T5-base with LoRA and Prompt-Tuning on GLUE tasks, and ViT with LoRA for CIFAR100). Optimisation experiments show that applying exact iEF as an optimiser provides strong convergence and generalisation. It achieves the best test performance and the lowest training loss for majority of the tasks, even when compared with well-tuned AdamW/Adafactor baselines. Additionally, under a novel empirical evaluation framework, the proposed iEF method shows consistently better approximation quality to the exact Natural Gradient updates than both EF and the more expensive sampled Fisher (SF). Further investigation also shows that the superior approximation quality of iEF is robust to damping across tasks and training stages. Improving existing approximate NGD optimisers with iEF is expected to lead to better convergence ability and stronger robustness to choice of damping.
翻译:近似自然梯度下降(NGD)方法是深度学习模型优化器的重要分支,其通过近似Fisher信息矩阵在训练过程中对梯度进行预调节。经验Fisher(EF)方法通过复用反向传播过程中收集的逐样本梯度来经验性地近似Fisher信息矩阵。尽管易于实现,但EF近似在理论和实践上均存在局限性。本文首先研究了EF的逆尺度投影问题,该问题被证明是导致经验近似质量低下的主要原因。从损失降低角度出发,本文提出了一种改进的经验Fisher(iEF)方法,将其推广为广义NGD方法,既解决了逆尺度投影问题,又保留了EF的实际便捷性。通过在典型深度学习框架下的实验评估,包括预训练模型参数高效微调的主流配置(基于LoRA和Prompt-Tuning的T5-base在GLUE任务上,以及基于LoRA的ViT在CIFAR100任务上),验证了精确iEF和EF方法的性能。优化实验表明,使用精确iEF作为优化器具有强大的收敛性和泛化能力,在多数任务中取得了最佳测试性能与最低训练损失,甚至优于经过良好调参的AdamW/Adafactor基线方法。此外,在新提出的实证评估框架下,iEF方法在近似精确自然梯度更新的质量上始终优于EF和计算成本更高的采样Fisher(SF)。进一步研究表明,iEF的优越近似质量对任务类型和训练阶段的阻尼系数均具有鲁棒性。将iEF应用于现有近似NGD优化器,有望提升其收敛能力并增强对阻尼系数选择的鲁棒性。