Huawei's openPangu-Embedded-1B and openPangu-Embedded-7B are variants of the openPangu large language model, designed for efficient deployment on Ascend NPUs. The 7B variant supports three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think, while the 1B variant operates exclusively in the no_think mode, which employs condensed reasoning for higher efficiency. Although CoT reasoning enhances capability, the generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation on code generation benchmarks (HumanEval and MBPP) demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90\% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.
翻译:华为的openPangu-Embedded-1B与openPangu-Embedded-7B是openPangu大语言模型的变体,专为在昇腾NPU上的高效部署而设计。7B变体支持三种不同的思维链推理范式,即slow_think、auto_think和no_think,而1B变体仅运行于no_think模式,该模式采用压缩推理以实现更高效率。尽管思维链推理增强了模型能力,但生成长推理轨迹会带来显著的内存与延迟开销,这为在昇腾NPU上的实际部署带来了挑战。本文通过利用低位量化来解决这些计算限制,该技术将FP16计算转换为更高效的整数运算。我们提出了一个统一的低位推理框架,支持INT8(W8A8)和W4A8量化,并专门针对Atlas A2平台上的openPangu-Embedded模型进行了优化。我们在代码生成基准测试上的综合评估证明了该方法的有效性。INT8量化在Atlas A2上持续保持了超过90\%的FP16基线精度,并实现了1.5倍的预填充加速。此外,W4A8量化显著降低了内存消耗,尽管在精度上存在适度的权衡。这些发现共同表明,低位量化能有效促进昇腾NPU上的高效思维链推理,同时保持较高的模型保真度。