如何为大规模预训练设置学习率？ (How to Set the Learning Rate for Large-Scale Pre-training?) - 专知论文

会员服务 ·

0

预训练 · 学习率 · 最优 · 拟合 · 搜索 ·

How to Set the Learning Rate for Large-Scale Pre-training?

翻译：如何为大规模预训练设置学习率？

Yunhua Zhou,Shuhao Xing,Junhao Huang,Xipeng Qiu,Qipeng Guo

Optimal configuration of the learning rate (LR) is a fundamental yet formidable challenge in large-scale pre-training. Given the stringent trade-off between training costs and model performance, the pivotal question is whether the optimal LR can be accurately extrapolated from low-cost experiments. In this paper, we formalize this investigation into two distinct research paradigms: Fitting and Transfer. Within the Fitting Paradigm, we innovatively introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n^3) to O(n*C_D*C_η) via predictive modeling. Within the Transfer Paradigm, we extend the principles of $μ$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons. By pushing the boundaries of existing hyperparameter research in terms of scale, we conduct a comprehensive comparison between these two paradigms. Our empirical results challenge the scalability of the widely adopted $μ$ Transfer in large-scale pre-training scenarios. Furthermore, we provide a rigorous analysis through the dual lenses of training stability and feature learning to elucidate the underlying reasons why module-wise parameter tuning underperforms in large-scale settings. This work offers systematic practical guidelines and a fresh theoretical perspective for optimizing industrial-level pre-training.

翻译：学习率（LR）的最优配置是大规模预训练中一个基础且极具挑战性的问题。鉴于训练成本与模型性能之间的严格权衡，核心问题在于：是否能够从低成本的实验中准确外推出最优学习率。在本文中，我们将此研究形式化为两种不同的研究范式：拟合范式与迁移范式。在拟合范式中，我们创新性地引入了搜索因子的缩放定律，通过预测建模将搜索复杂度从 O(n^3) 有效降低至 O(n*C_D*C_η)。在迁移范式中，我们将 $μ$Transfer 的原理扩展至混合专家（MoE）架构，从而将其适用范围拓宽至模型深度、权重衰减和令牌范围。通过将现有超参数研究的规模边界推向极致，我们对这两种范式进行了全面比较。我们的实证结果对广泛采用的 $μ$Transfer 在大规模预训练场景中的可扩展性提出了挑战。此外，我们通过训练稳定性和特征学习的双重视角进行了严谨分析，以阐明模块级参数调优在大规模设置中表现不佳的根本原因。这项工作为优化工业级预训练提供了系统的实践指南和全新的理论视角。

0

相关内容

预训练

在搭建网络模型时，需要随机初始化参数，然后开始训练网络，不断调整直到网络的损失越来越小。在训练的过程中，一开始初始化的参数会不断变化。当参数训练到比较好的时候就可以将训练模型的参数保存下来，以便训练好的模型可以在下次执行类似任务时获得较好的结果。

预训练视觉模型的参数高效微调

预训练视觉模型的参数高效微调

专知会员服务

32+阅读 · 2024年3月19日

参数高效微调方法有哪些？岭大等最新《预训练语言模型的参数高效微调》综述，

参数高效微调方法有哪些？岭大等最新《预训练语言模型的参数高效微调》综述，

专知会员服务

70+阅读 · 2023年12月21日

Transformer如何训得更快更好？莫纳什大学最新《Transformer高效训练》综述，详述训练Transformer技术

Transformer如何训得更快更好？莫纳什大学最新《Transformer高效训练》综述，详述训练Transformer技术

专知会员服务

61+阅读 · 2023年2月4日

强化学习如何预训练？上交大腾讯最新《深度强化学习预训练》综述，41页pdf阐述DRL预训练在线离线方法

强化学习如何预训练？上交大腾讯最新《深度强化学习预训练》综述，41页pdf阐述DRL预训练在线离线方法

专知会员服务

67+阅读 · 2022年11月9日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

新加坡国立大学最新「大规模深度学习优化」综述论文，带你全面了解最新深度学习准确率和效率的优化方法

新加坡国立大学最新「大规模深度学习优化」综述论文，带你全面了解最新深度学习准确率和效率的优化方法

专知会员服务

54+阅读 · 2021年11月19日

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

专知会员服务

40+阅读 · 2020年4月17日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

专知会员服务

78+阅读 · 2020年3月1日

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

专知会员服务

24+阅读 · 2019年11月20日

基于模型的强化学习综述

基于模型的强化学习综述

专知

42+阅读 · 2022年7月13日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

专知

21+阅读 · 2019年11月14日

【CMU教程】高效大规模机器学习训练，198页PDF带你概览领域前沿进展

【CMU教程】高效大规模机器学习训练，198页PDF带你概览领域前沿进展

专知

14+阅读 · 2019年10月9日

以BERT为例,如何优化机器学习模型性能?

以BERT为例,如何优化机器学习模型性能?

专知

10+阅读 · 2019年10月3日

训练数据多少才够用

训练数据多少才够用

专知

16+阅读 · 2019年5月4日

入门 | 深度学习模型的简单优化技巧

入门 | 深度学习模型的简单优化技巧

机器之心

10+阅读 · 2018年6月10日

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

机器学习研究会

19+阅读 · 2018年3月11日

什么是学习率，以及它是如何影响深度学习的

什么是学习率，以及它是如何影响深度学习的

论智

85+阅读 · 2018年2月3日

如何找到最优学习率？

如何找到最优学习率？

AI研习社

11+阅读 · 2017年11月29日

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

43+阅读 · 2015年12月31日

分布无关的概率图模型结构学习方法的研究

国家自然科学基金

4+阅读 · 2015年12月31日

复杂环境下机器学习的理论研究

国家自然科学基金

21+阅读 · 2015年12月31日

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

面向大规模多步学习问题的学习分类元系统技术研究

国家自然科学基金

5+阅读 · 2015年12月31日

面向大数据的安全迁移学习方法

国家自然科学基金

31+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

概率图模型学习及其在数据分析中的应用研究

国家自然科学基金

16+阅读 · 2013年12月31日

不确定环境下强化学习和决策的神经机制

国家自然科学基金

11+阅读 · 2012年12月31日

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Arxiv

0+阅读 · 2月5日

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Arxiv

0+阅读 · 2月4日

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

Arxiv

0+阅读 · 2月4日

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Arxiv

0+阅读 · 2月3日

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Arxiv

0+阅读 · 2月2日

RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems

Arxiv

0+阅读 · 1月27日

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Arxiv

0+阅读 · 1月20日

Optimal Learning Rate Schedule for Balancing Effort and Performance

Arxiv

0+阅读 · 1月12日

SuperFlow: Training Flow Matching Models with RL on the Fly

Arxiv

0+阅读 · 1月12日

How to Set the Batch Size for Large-Scale Pre-training?

Arxiv

0+阅读 · 1月8日

VIP会员

文章信息

相关主题

相关VIP内容

预训练视觉模型的参数高效微调

预训练视觉模型的参数高效微调

专知会员服务

32+阅读 · 2024年3月19日

参数高效微调方法有哪些？岭大等最新《预训练语言模型的参数高效微调》综述，

参数高效微调方法有哪些？岭大等最新《预训练语言模型的参数高效微调》综述，

专知会员服务

70+阅读 · 2023年12月21日

Transformer如何训得更快更好？莫纳什大学最新《Transformer高效训练》综述，详述训练Transformer技术

Transformer如何训得更快更好？莫纳什大学最新《Transformer高效训练》综述，详述训练Transformer技术

专知会员服务

61+阅读 · 2023年2月4日

强化学习如何预训练？上交大腾讯最新《深度强化学习预训练》综述，41页pdf阐述DRL预训练在线离线方法

强化学习如何预训练？上交大腾讯最新《深度强化学习预训练》综述，41页pdf阐述DRL预训练在线离线方法

专知会员服务

67+阅读 · 2022年11月9日

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

【清华大学】Delta调优:预训练语言模型参数有效方法的综合研究，Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

专知会员服务

26+阅读 · 2022年3月15日

新加坡国立大学最新「大规模深度学习优化」综述论文，带你全面了解最新深度学习准确率和效率的优化方法

新加坡国立大学最新「大规模深度学习优化」综述论文，带你全面了解最新深度学习准确率和效率的优化方法

专知会员服务

54+阅读 · 2021年11月19日

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

【微软亚研】预训练文本表示作为元学习，Pre-training Text Representations

专知会员服务

40+阅读 · 2020年4月17日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

【斯坦福大学-ICLR2020】图神经网络预训练的策略，Strategies for Pre-training Graph Neural Networks

专知会员服务

78+阅读 · 2020年3月1日

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

专知会员服务

24+阅读 · 2019年11月20日

热门VIP内容

开通专知VIP会员享更多权益服务

论学习、公平性与复杂度

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

2025中国人工智能学会系列白皮书⸺棋盘上的人工智能|附下载

通用智能体评估的逻辑架构

相关资讯

基于模型的强化学习综述

基于模型的强化学习综述

专知

42+阅读 · 2022年7月13日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

【加州理工】什么是模仿学习(Imitation Learning（模仿学习), 这62页ppt带你了解进展，附下载

专知

21+阅读 · 2019年11月14日

【CMU教程】高效大规模机器学习训练，198页PDF带你概览领域前沿进展

【CMU教程】高效大规模机器学习训练，198页PDF带你概览领域前沿进展

专知

14+阅读 · 2019年10月9日

以BERT为例,如何优化机器学习模型性能?

以BERT为例,如何优化机器学习模型性能?

专知

10+阅读 · 2019年10月3日

训练数据多少才够用

训练数据多少才够用

专知

16+阅读 · 2019年5月4日

入门 | 深度学习模型的简单优化技巧

入门 | 深度学习模型的简单优化技巧

机器之心

10+阅读 · 2018年6月10日

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

【机器学习基本理论】详解最大似然估计（MLE）、最大后验概率估计（MAP），以及贝叶斯公式的理解

机器学习研究会

19+阅读 · 2018年3月11日

什么是学习率，以及它是如何影响深度学习的

什么是学习率，以及它是如何影响深度学习的

论智

85+阅读 · 2018年2月3日

如何找到最优学习率？

如何找到最优学习率？

AI研习社

11+阅读 · 2017年11月29日

相关论文

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Arxiv

0+阅读 · 2月5日

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Arxiv

0+阅读 · 2月4日

Provable Target Sample Complexity Improvements as Pre-Trained Models Scale

Arxiv

0+阅读 · 2月4日

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Arxiv

0+阅读 · 2月3日

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Arxiv

0+阅读 · 2月2日

RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems

Arxiv

0+阅读 · 1月27日

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Arxiv

0+阅读 · 1月20日

Optimal Learning Rate Schedule for Balancing Effort and Performance

Arxiv

0+阅读 · 1月12日

SuperFlow: Training Flow Matching Models with RL on the Fly

Arxiv

0+阅读 · 1月12日

How to Set the Batch Size for Large-Scale Pre-training?

Arxiv

0+阅读 · 1月8日

相关基金

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

43+阅读 · 2015年12月31日

分布无关的概率图模型结构学习方法的研究

国家自然科学基金

4+阅读 · 2015年12月31日

复杂环境下机器学习的理论研究

国家自然科学基金

21+阅读 · 2015年12月31日

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

排序与半监督学习的误差分析

国家自然科学基金

0+阅读 · 2015年12月31日

面向大规模多步学习问题的学习分类元系统技术研究

国家自然科学基金

5+阅读 · 2015年12月31日

面向大数据的安全迁移学习方法

国家自然科学基金

31+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

概率图模型学习及其在数据分析中的应用研究

国家自然科学基金

16+阅读 · 2013年12月31日

不确定环境下强化学习和决策的神经机制

国家自然科学基金

11+阅读 · 2012年12月31日

微信扫码咨询专知VIP会员