Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training - 专知论文

会员服务 ·

0

优化器 · MoDELS · 缩放 · 噪声 · Weight ·

Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training

翻译：暂无翻译

Christian Brandt Thomassen

from arxiv, 20 pages, 6 figures, 4 tables. 1345 training runs total (720 + 625). Submitted for review at TMLR

We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100M, 5 seeds) finds the optimal warmdown is 33% at every (bit-width, size) cell. The primary hypothesis -- that INT6 QAT requires a different schedule than higher-precision training -- is falsified at FP16/INT8/INT6. A 625-run follow-up (Phase 5) probes the null along five axes: optimiser (AdamW), schedule shape (cosine), training length (up to 9x more iterations), an extended size sweep (5M-350M), and an INT4 sweep from 3M to 100M. The null is robust under all three setup changes. The INT6 penalty follows a log-linear scaling law whose fit on Phase 2 predicts the five held-out Phase 5 sizes (5M, 8M, 175M, 250M, 350M) within their 95% prediction intervals (5/5). For INT4 the picture is sharper than the higher precisions: at 50M and 100M, wd33 is decisively optimal (paired z ~ 12-15, 10/10 seeds); below 50M, across the six tested sizes from 3M to 30M, no individual size shows a statistically significant schedule preference and the per-size mean penalty oscillates within seed-level noise. The boundary is therefore a transition between a noise-dominated regime below 50M and a decisive wd33 regime at and above 50M, not a clean wd10 region. A weight-to-grid-distance probe falsifies the simplest mechanism for the FP16/INT8/INT6 null result (rapid grid-snapping): pre-warmdown, INT6-QAT weights sit at essentially the same distance from the INT6 grid as FP16 weights (ratio ~ 1.04). Practical recommendation: at sub-100M scale, tune the LR schedule once at FP16 and apply unchanged to INT8/INT6 QAT; for INT4 at 50M+ use wd33; for INT4 below 50M the schedule choice is in the noise.

翻译：暂无翻译

0

相关内容

优化器

IEEE Proc.｜基于知识图谱的少样本和零样本学习综述

IEEE Proc.｜基于知识图谱的少样本和零样本学习综述

专知会员服务

49+阅读 · 2024年2月2日

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

专知会员服务

42+阅读 · 2022年10月15日

【CMU博士论文】用动态超参数优化改进深度学习训练和推理，Improving Deep Learning Training and Inference with Dynamic Hyperparameter Optimization

【CMU博士论文】用动态超参数优化改进深度学习训练和推理，Improving Deep Learning Training and Inference with Dynamic Hyperparameter Optimization

专知会员服务

55+阅读 · 2020年5月26日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

专知会员服务

12+阅读 · 2020年1月7日

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

专知会员服务

24+阅读 · 2019年11月20日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

专知会员服务

19+阅读 · 2019年6月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

【泡泡点云时空】SpiderCNN：利用参数化卷积滤波进行点集深度学习（ECCV2018-13）

【泡泡点云时空】SpiderCNN：利用参数化卷积滤波进行点集深度学习（ECCV2018-13）

泡泡机器人SLAM

10+阅读 · 2018年11月8日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

专知

19+阅读 · 2018年5月31日

手把手教 | 深度学习库PyTorch（附代码）

手把手教 | 深度学习库PyTorch（附代码）

数据派THU

27+阅读 · 2018年3月15日

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

专知

17+阅读 · 2018年2月11日

原创 | Attention Modeling for Targeted Sentiment

原创 | Attention Modeling for Targeted Sentiment

黑龙江大学自然语言处理实验室

25+阅读 · 2017年11月5日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

大规模多视角高维图像特征提取

国家自然科学基金

5+阅读 · 2017年12月31日

测量误差数据下部分线性模型有约束统计推断理论

国家自然科学基金

2+阅读 · 2015年12月31日

模拟人眼视觉特性的高性能矢量多边形叠加分析算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

分布无关的概率图模型结构学习方法的研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于图的半监督学习算法研究

国家自然科学基金

5+阅读 · 2015年12月31日

M-矩阵（张量）最小特征值估计及其相关问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

小快拍数下宽带信号超分辨测向性能的多元优化方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于深度神经网络的雷达目标高分辨距离像稳健识别方法

国家自然科学基金

6+阅读 · 2015年12月31日

面向星载综合电子设备的智能BIT关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

超分辨率中的矩阵值算子学习问题

国家自然科学基金

1+阅读 · 2014年12月31日

Geometry-Aware Dataset Condensation for Diffusion Model Training

Arxiv

0+阅读 · 6月17日

MiniFool -- Physics-Constraint-Aware Minimizer-Based Adversarial Attacks in Deep Neural Networks

Arxiv

0+阅读 · 6月16日

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Arxiv

0+阅读 · 6月11日

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Arxiv

0+阅读 · 6月11日

Communication Strategy Selection for Multi-GPU 3D FDTD with Convolutional Perfectly Matched Boundary Layers

Arxiv

0+阅读 · 6月5日

Scaling Qubit Mapping and Routing With Position Graph Abstraction and Memoization

Arxiv

0+阅读 · 5月10日

Rate-optimal and computationally efficient nonparametric estimation on the circle and the sphere

Arxiv

0+阅读 · 5月6日

Large-Scale Deep Learning Optimizations: A Comprehensive Survey

Arxiv

23+阅读 · 2021年11月2日

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Arxiv

11+阅读 · 2021年4月29日

Learning Heuristics over Large Graphs via Deep Reinforcement Learning

Arxiv

12+阅读 · 2019年3月8日

VIP会员

文章信息

相关主题

最新内容

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

6+阅读 · 今天2:06

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

5+阅读 · 今天1:37

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

3+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

5+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

4+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

6+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

6+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

3+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

5+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

5+阅读 · 6月17日

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

4+阅读 · 6月17日

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

专知会员服务

3+阅读 · 6月17日

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

6+阅读 · 6月16日

多模态代码智能综述：从视觉输入到可执行代码系统

多模态代码智能综述：从视觉输入到可执行代码系统

专知会员服务

8+阅读 · 6月16日

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

专知会员服务

6+阅读 · 6月16日

相关VIP内容

IEEE Proc.｜基于知识图谱的少样本和零样本学习综述

IEEE Proc.｜基于知识图谱的少样本和零样本学习综述

专知会员服务

49+阅读 · 2024年2月2日

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

专知会员服务

42+阅读 · 2022年10月15日

【CMU博士论文】用动态超参数优化改进深度学习训练和推理，Improving Deep Learning Training and Inference with Dynamic Hyperparameter Optimization

【CMU博士论文】用动态超参数优化改进深度学习训练和推理，Improving Deep Learning Training and Inference with Dynamic Hyperparameter Optimization

专知会员服务

55+阅读 · 2020年5月26日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

【Google无监督大规模视觉表示迁移】Large Scale Learning of General Visual Representations for Transfer

专知会员服务

12+阅读 · 2020年1月7日

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

【论文】自训练噪声student模型提高ImageNet分类准确率（Self-training with Noisy Student improves ImageNet classification），谷歌研究科学家Quoc V. Le等

专知会员服务

24+阅读 · 2019年11月20日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

专知会员服务

19+阅读 · 2019年6月16日

热门VIP内容

开通专知VIP会员享更多权益服务

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

【泡泡点云时空】SpiderCNN：利用参数化卷积滤波进行点集深度学习（ECCV2018-13）

【泡泡点云时空】SpiderCNN：利用参数化卷积滤波进行点集深度学习（ECCV2018-13）

泡泡机器人SLAM

10+阅读 · 2018年11月8日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

【论文推荐】最新七篇图像分割相关论文—Attention U-Net、对抗结构匹配损失、卷积CRFs、对抗样本、弱监督分割

专知

19+阅读 · 2018年5月31日

手把手教 | 深度学习库PyTorch（附代码）

手把手教 | 深度学习库PyTorch（附代码）

数据派THU

27+阅读 · 2018年3月15日

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

【论文推荐】最新5篇度量学习（Metric Learning）相关论文—人脸验证、BIER、自适应图卷积、注意力机制、单次学习

专知

17+阅读 · 2018年2月11日

原创 | Attention Modeling for Targeted Sentiment

原创 | Attention Modeling for Targeted Sentiment

黑龙江大学自然语言处理实验室

25+阅读 · 2017年11月5日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Geometry-Aware Dataset Condensation for Diffusion Model Training

Arxiv

0+阅读 · 6月17日

MiniFool -- Physics-Constraint-Aware Minimizer-Based Adversarial Attacks in Deep Neural Networks

Arxiv

0+阅读 · 6月16日

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Arxiv

0+阅读 · 6月11日

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Arxiv

0+阅读 · 6月11日

Communication Strategy Selection for Multi-GPU 3D FDTD with Convolutional Perfectly Matched Boundary Layers

Arxiv

0+阅读 · 6月5日

Scaling Qubit Mapping and Routing With Position Graph Abstraction and Memoization

Arxiv

0+阅读 · 5月10日

Rate-optimal and computationally efficient nonparametric estimation on the circle and the sphere

Arxiv

0+阅读 · 5月6日

Large-Scale Deep Learning Optimizations: A Comprehensive Survey

Arxiv

23+阅读 · 2021年11月2日

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Arxiv

11+阅读 · 2021年4月29日

Learning Heuristics over Large Graphs via Deep Reinforcement Learning

Arxiv

12+阅读 · 2019年3月8日

相关基金

大规模多视角高维图像特征提取

国家自然科学基金

5+阅读 · 2017年12月31日

测量误差数据下部分线性模型有约束统计推断理论

国家自然科学基金

2+阅读 · 2015年12月31日

模拟人眼视觉特性的高性能矢量多边形叠加分析算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

分布无关的概率图模型结构学习方法的研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于图的半监督学习算法研究

国家自然科学基金

5+阅读 · 2015年12月31日

M-矩阵（张量）最小特征值估计及其相关问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

小快拍数下宽带信号超分辨测向性能的多元优化方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于深度神经网络的雷达目标高分辨距离像稳健识别方法

国家自然科学基金

6+阅读 · 2015年12月31日

面向星载综合电子设备的智能BIT关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

超分辨率中的矩阵值算子学习问题

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员