This paper explores Large Batch Training techniques using layer-wise adaptive scaling ratio (LARS) across diverse settings, uncovering insights. LARS algorithms with warm-up tend to be trapped in sharp minimizers early on due to redundant ratio scaling. Additionally, a fixed steep decline in the latter phase restricts deep neural networks from effectively navigating early-phase sharp minimizers. Building on these findings, we propose Time Varying LARS (TVLARS), a novel algorithm that replaces warm-up with a configurable sigmoid-like function for robust training in the initial phase. TVLARS promotes gradient exploration early on, surpassing sharp optimizers and gradually transitioning to LARS for robustness in later phases. Extensive experiments demonstrate that TVLARS consistently outperforms LARS and LAMB in most cases, with up to 2\% improvement in classification scenarios. Notably, in all self-supervised learning cases, TVLARS dominates LARS and LAMB with performance improvements of up to 10\%.
翻译:本文探讨了在不同设置下使用逐层自适应缩放比率(LARS)的大批量训练技术,并揭示了若干见解。带预热机制的LARS算法由于冗余的比率缩放,容易在早期陷入尖锐极小值。此外,后期阶段固定的陡峭下降模式阻碍了深度神经网络有效摆脱早期的尖锐极小值。基于这些发现,我们提出了时变LARS(TVLARS)——一种采用可配置Sigmoid型函数替代预热机制的新算法,以确保初始阶段的鲁棒训练。TVLARS在早期促进梯度探索,助其超越尖锐优化器,并逐步过渡至LARS以保障后期阶段的鲁棒性。大量实验表明,TVLARS在大多数情况下持续优于LARS和LAMB,在分类场景中性能提升最高达2%。值得注意的是,在所有自监督学习案例中,TVLARS以最高10%的性能提升全面超越LARS和LAMB。