Training large language models using 4-bit arithmetic enhances throughput and memory efficiency. Yet, the limited dynamic range of FP4 increases sensitivity to outliers. While NVFP4 mitigates quantization error via hierarchical microscaling, a persistent loss gap remains compared to BF16. This study conducts a longitudinal analysis of outlier dynamics across architecture during NVFP4 pretraining, focusing on where they localize, why they occur, and how they evolve temporally. We find that, compared with Softmax Attention (SA), Linear Attention (LA) reduces per-tensor heavy tails but still exhibits persistent block-level spikes under block quantization. Our analysis attributes outliers to specific architectural components: Softmax in SA, gating in LA, and SwiGLU in FFN, with "post-QK" operations exhibiting higher sensitivity to quantization. Notably, outliers evolve from transient spikes early in training to a small set of persistent hot channels (i.e., channels with persistently large magnitudes) in later stages. Based on these findings, we introduce Hot-Channel Patch (HCP), an online compensation mechanism that identifies hot channels and reinjects residuals using hardware-efficient kernels. We then develop CHON, an NVFP4 training recipe integrating HCP with post-QK operation protection. On GLA-1.3B model trained for 60B tokens, CHON reduces the loss gap to BF16 from 0.94% to 0.58% while maintaining downstream accuracy.
翻译:使用4位算术训练大型语言模型可提升吞吐量与内存效率。然而,FP4有限的动态范围增加了对异常值的敏感性。尽管NVFP4通过分层微缩放缓解了量化误差,但与BF16相比仍存在持续的损失差距。本研究对NVFP4预训练过程中跨架构的异常值动态进行了纵向分析,重点关注其定位位置、产生原因及时序演化规律。我们发现,与Softmax Attention(SA)相比,Linear Attention(LA)降低了张量级的重尾分布,但在块量化下仍表现出持续的块级尖峰。分析表明异常值源于特定架构组件:SA中的Softmax、LA中的门控机制以及FFN中的SwiGLU,其中"后QK"运算对量化表现出更高敏感性。值得注意的是,异常值从训练早期的瞬态尖峰演变为后期阶段少量持续存在的热通道(即具有持续大幅值的通道)。基于这些发现,我们提出了热通道修补(HCP)——一种在线补偿机制,通过硬件高效内核识别热通道并重新注入残差。进而开发了CHON训练方案,将HCP与后QK运算保护相结合。在GLA-1.3B模型60B词元的训练中,CHON将相对于BF16的损失差距从0.94%降低至0.58%,同时保持下游任务准确率。