Revisiting LARS for Large Batch Training Generalization of Neural Networks

LARS and LAMB have emerged as prominent techniques in Large Batch Learning (LBL) to ensure training stability in AI. Convergence stability is a challenge in LBL, where the AI agent usually gets trapped in the sharp minimizer. To address this challenge, warm-up is an efficient technique, but it lacks a strong theoretical foundation. Specifically, the warm-up process often reduces gradients in the early phase, inadvertently preventing the agent from escaping the sharp minimizer early on. In light of this situation, we conduct empirical experiments to analyze the behaviors of LARS and LAMB with and without a warm-up strategy. Our analyses give a comprehensive insight into the behaviors of LARS, LAMB, and the necessity of a warm-up technique in LBL, including an explanation of their failure in many cases. Building upon these insights, we propose a novel algorithm called Time Varying LARS (TVLARS), which facilitates robust training in the initial phase without the need for warm-up. A configurable sigmoid-like function is employed in TVLARS to replace the warm-up process to enhance training stability. Moreover, TVLARS stimulates gradient exploration in the early phase, thus allowing it to surpass the sharp minimizes early on and gradually transition to LARS and achieving robustness of LARS in the latter phases. Extensive experimental evaluations reveal that TVLARS consistently outperforms LARS and LAMB in most cases, with improvements of up to 2% in classification scenarios. Notably, in every case of self-supervised learning, TVLARS dominates LARS and LAMB with performance improvements of up to 10%.

翻译：LARS和LAMB已成为大规模批量学习（LBL）中确保AI训练稳定性的重要技术。收敛稳定性是LBL面临的挑战，AI智能体通常会陷入尖锐极小值区域。为应对该挑战，预热是一种高效技术，但缺乏坚实的理论基础。具体而言，预热过程通常在训练早期阶段降低梯度，无意中阻止智能体在初期逃离尖锐极小值区域。基于此，我们开展实证实验，分析采用与未采用预热策略时LARS和LAMB的行为特性。我们的分析提供了对LARS、LAMB行为以及预热技术在LBL中必要性的全面洞见，包括对其在多种情况下失效原因的解释。基于这些洞见，我们提出一种名为时变LARS（TVLARS）的新算法，该算法在训练初始阶段无需预热即可实现鲁棒训练。TVLARS采用可配置的类Sigmoid函数替代预热过程以增强训练稳定性。此外，TVLARS在早期阶段激励梯度探索，从而使其能提前超越尖锐极小值区域并逐步过渡至LARS，在后期阶段继承LARS的鲁棒性。大量实验评估表明，TVLARS在大多数情况下始终优于LARS和LAMB，在分类场景中性能提升可达2%。值得注意的是，在自监督学习的各类任务中，TVLARS均显著优于LARS和LAMB，性能提升幅度高达10%。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日