面向Atlas A2高效部署的OpenPangu模型训练后量化 (Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2) - 专知论文

会员服务 ·

0

NPU · 思维链推理 · 思维链 · 内存 · INT8 ·

Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

翻译：面向Atlas A2高效部署的OpenPangu模型训练后量化

Yilun Luo,Huaqing Zheng,Haoqian Meng,Wenyuan Liu,Peng Zhang

Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B are variants of the openPangu large language model, designed for efficient deployment on Ascend NPUs. The 7B variant supports three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think, while the 1B variant operates exclusively in the no_think mode, which employs condensed reasoning for higher efficiency. Although CoT reasoning enhances capability, the generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation on code generation benchmarks (HumanEval and MBPP) demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90\% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.

翻译：华为的openPangu-Embedded-1B与openPangu-Embedded-7B是openPangu大语言模型的变体，专为在昇腾NPU上的高效部署而设计。7B变体支持三种不同的思维链推理范式，即slow_think、auto_think和no_think，而1B变体仅运行于no_think模式，该模式采用压缩推理以实现更高效率。尽管思维链推理增强了模型能力，但生成长推理轨迹会带来显著的内存与延迟开销，这为在昇腾NPU上的实际部署带来了挑战。本文通过利用低位量化来解决这些计算限制，该技术将FP16计算转换为更高效的整数运算。我们提出了一个统一的低位推理框架，支持INT8（W8A8）和W4A8量化，并专门针对Atlas A2平台上的openPangu-Embedded模型进行了优化。我们在代码生成基准测试上的综合评估证明了该方法的有效性。INT8量化在Atlas A2上持续保持了超过90\%的FP16基线精度，并实现了1.5倍的预填充加速。此外，W4A8量化显著降低了内存消耗，尽管在精度上存在适度的权衡。这些发现共同表明，低位量化能有效促进昇腾NPU上的高效思维链推理，同时保持较高的模型保真度。

0

相关内容

NPU

OpenAI“开放权重模型”即将进入美军作战体系

OpenAI“开放权重模型”即将进入美军作战体系

专知会员服务

27+阅读 · 2025年11月20日

探究模型能力与应用的进展和边界

探究模型能力与应用的进展和边界

专知会员服务

25+阅读 · 2025年8月27日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日

《OpenAI o1大模型》中英文技术报告，44页pdf

《OpenAI o1大模型》中英文技术报告，44页pdf

专知会员服务

149+阅读 · 2024年9月15日

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

专知会员服务

36+阅读 · 2024年7月30日

大模型报告:模型能力决定下限，场景适配度决定上限

大模型报告:模型能力决定下限，场景适配度决定上限

专知会员服务

57+阅读 · 2024年6月3日

使用 OpenLLM 构建和部署大模型应用

使用 OpenLLM 构建和部署大模型应用

专知会员服务

55+阅读 · 2024年1月4日

【视频】State of GPT：大神Andrej揭秘OpenAI大模型原理和训练过程，附Slides

【视频】State of GPT：大神Andrej揭秘OpenAI大模型原理和训练过程，附Slides

专知会员服务

107+阅读 · 2023年5月29日

清华大学唐杰团队ChatGLM-6B，《从千亿模型到ChatGPT的⼀点思考》，67页ppt

清华大学唐杰团队ChatGLM-6B，《从千亿模型到ChatGPT的⼀点思考》，67页ppt

专知会员服务

135+阅读 · 2023年4月15日

ChatGPT背后的大模型最新有哪些？最新最全《Transformer预训练模型分类》论文，36页pdf详述大模型技术目录

ChatGPT背后的大模型最新有哪些？最新最全《Transformer预训练模型分类》论文，36页pdf详述大模型技术目录

专知会员服务

199+阅读 · 2023年2月17日

OpenAI超级对话模型ChatGPT发布！智能回答堪比雅思口语满分案例

OpenAI超级对话模型ChatGPT发布！智能回答堪比雅思口语满分案例

新智元

29+阅读 · 2022年12月1日

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 2

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 2

PaperWeekly

14+阅读 · 2022年5月21日

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

专知

17+阅读 · 2019年12月7日

绝对干货！NLP预训练模型：从transformer到albert

绝对干货！NLP预训练模型：从transformer到albert

新智元

13+阅读 · 2019年11月10日

OpenNRE 2.0：可一键运行的开源关系抽取工具包

OpenNRE 2.0：可一键运行的开源关系抽取工具包

PaperWeekly

22+阅读 · 2019年10月30日

轻量attention模块：Spatial Group-wise Enhance

轻量attention模块：Spatial Group-wise Enhance

极市平台

15+阅读 · 2019年7月3日

逆天语言模型GPT-2最新开源：345M预训练模型和1.5B参数都来了

逆天语言模型GPT-2最新开源：345M预训练模型和1.5B参数都来了

量子位

18+阅读 · 2019年5月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

从Seq2seq到Attention模型到Self Attention（一）

从Seq2seq到Attention模型到Self Attention（一）

量化投资与机器学习

76+阅读 · 2018年10月8日

面向车联网海量高速移动终端的高效信道信息获取机制

国家自然科学基金

0+阅读 · 2017年12月31日

众核集群上基于MPI的模型扩展及性能优化研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于支撑函数的不规则形态扩展目标建模和估计研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于扩展的概率转移矩阵模型的高精度快速广义门电路可靠性评估方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

高采样率、高量化分辨率一体化全光模数转换关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于分布式∑/△与扩展量化的红外焦平面阵列像素级/列级混合式模数转换方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大规模数据流的集成学习模型与方法研究

国家自然科学基金

5+阅读 · 2014年12月31日

基于动态缩比模型的操纵面效能测试方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据计算的高吞吐量众核处理器关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于模型驱动的并发建模语言Apla+设计及其可靠性研究

国家自然科学基金

3+阅读 · 2014年12月31日

Regularized Calibration with Successive Rounding for Post-Training Quantization

Arxiv

0+阅读 · 2月5日

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Arxiv

0+阅读 · 2月3日

ARB-LLM: Alternating Refined Binarizations for Large Language Models

Arxiv

0+阅读 · 1月30日

Evaluating the Impact of Post-Training Quantization on Large Language Models for Code Generation

Arxiv

0+阅读 · 1月27日

OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

Arxiv

0+阅读 · 1月20日

Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats

Arxiv

0+阅读 · 1月14日

Late Breaking Results: Quamba-SE: Soft-edge Quantizer for Activations in State Space Models

Arxiv

0+阅读 · 1月14日

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Arxiv

0+阅读 · 1月7日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

Reinforcement Learning on Graph: A Survey

Arxiv

67+阅读 · 2022年4月13日

VIP会员

文章信息

相关主题

思维链推理

相关VIP内容

OpenAI“开放权重模型”即将进入美军作战体系

OpenAI“开放权重模型”即将进入美军作战体系

专知会员服务

27+阅读 · 2025年11月20日

探究模型能力与应用的进展和边界

探究模型能力与应用的进展和边界

专知会员服务

25+阅读 · 2025年8月27日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日

《OpenAI o1大模型》中英文技术报告，44页pdf

《OpenAI o1大模型》中英文技术报告，44页pdf

专知会员服务

149+阅读 · 2024年9月15日

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

专知会员服务

36+阅读 · 2024年7月30日

大模型报告:模型能力决定下限，场景适配度决定上限

大模型报告:模型能力决定下限，场景适配度决定上限

专知会员服务

57+阅读 · 2024年6月3日

使用 OpenLLM 构建和部署大模型应用

使用 OpenLLM 构建和部署大模型应用

专知会员服务

55+阅读 · 2024年1月4日

【视频】State of GPT：大神Andrej揭秘OpenAI大模型原理和训练过程，附Slides

【视频】State of GPT：大神Andrej揭秘OpenAI大模型原理和训练过程，附Slides

专知会员服务

107+阅读 · 2023年5月29日

清华大学唐杰团队ChatGLM-6B，《从千亿模型到ChatGPT的⼀点思考》，67页ppt

清华大学唐杰团队ChatGLM-6B，《从千亿模型到ChatGPT的⼀点思考》，67页ppt

专知会员服务

135+阅读 · 2023年4月15日

ChatGPT背后的大模型最新有哪些？最新最全《Transformer预训练模型分类》论文，36页pdf详述大模型技术目录

ChatGPT背后的大模型最新有哪些？最新最全《Transformer预训练模型分类》论文，36页pdf详述大模型技术目录

专知会员服务

199+阅读 · 2023年2月17日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基于自适应表征的高效视觉建模

《多域作战中融合网络、电子战与动能机动》

AI智能体时代大模型安全风险与攻防新挑战

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

相关资讯

OpenAI超级对话模型ChatGPT发布！智能回答堪比雅思口语满分案例

OpenAI超级对话模型ChatGPT发布！智能回答堪比雅思口语满分案例

新智元

29+阅读 · 2022年12月1日

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 2

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 2

PaperWeekly

14+阅读 · 2022年5月21日

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

专知

17+阅读 · 2019年12月7日

绝对干货！NLP预训练模型：从transformer到albert

绝对干货！NLP预训练模型：从transformer到albert

新智元

13+阅读 · 2019年11月10日

OpenNRE 2.0：可一键运行的开源关系抽取工具包

OpenNRE 2.0：可一键运行的开源关系抽取工具包

PaperWeekly

22+阅读 · 2019年10月30日

轻量attention模块：Spatial Group-wise Enhance

轻量attention模块：Spatial Group-wise Enhance

极市平台

15+阅读 · 2019年7月3日

逆天语言模型GPT-2最新开源：345M预训练模型和1.5B参数都来了

逆天语言模型GPT-2最新开源：345M预训练模型和1.5B参数都来了

量子位

18+阅读 · 2019年5月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

从Seq2seq到Attention模型到Self Attention（一）

从Seq2seq到Attention模型到Self Attention（一）

量化投资与机器学习

76+阅读 · 2018年10月8日

相关论文

Regularized Calibration with Successive Rounding for Post-Training Quantization

Arxiv

0+阅读 · 2月5日

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Arxiv

0+阅读 · 2月3日

ARB-LLM: Alternating Refined Binarizations for Large Language Models

Arxiv

0+阅读 · 1月30日

Evaluating the Impact of Post-Training Quantization on Large Language Models for Code Generation

Arxiv

0+阅读 · 1月27日

OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

Arxiv

0+阅读 · 1月20日

Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats

Arxiv

0+阅读 · 1月14日

Late Breaking Results: Quamba-SE: Soft-edge Quantizer for Activations in State Space Models

Arxiv

0+阅读 · 1月14日

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Arxiv

0+阅读 · 1月7日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

Reinforcement Learning on Graph: A Survey

Arxiv

67+阅读 · 2022年4月13日

相关基金

面向车联网海量高速移动终端的高效信道信息获取机制

国家自然科学基金

0+阅读 · 2017年12月31日

众核集群上基于MPI的模型扩展及性能优化研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于支撑函数的不规则形态扩展目标建模和估计研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于扩展的概率转移矩阵模型的高精度快速广义门电路可靠性评估方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

高采样率、高量化分辨率一体化全光模数转换关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于分布式∑/△与扩展量化的红外焦平面阵列像素级/列级混合式模数转换方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大规模数据流的集成学习模型与方法研究

国家自然科学基金

5+阅读 · 2014年12月31日

基于动态缩比模型的操纵面效能测试方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据计算的高吞吐量众核处理器关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于模型驱动的并发建模语言Apla+设计及其可靠性研究

国家自然科学基金

3+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员