Existing low-bit Microscaling (MX) formats, such as MXFP4, often suffer from substantial accuracy degradation due to the use of a shared scaling factor with the Power-of-Two format. In this work, we explore strategies that introduce minimal metadata to recover accuracy lost during quantization while maintaining high bit efficiency across a wide range of large language models. We propose a complete algorithm-hardware co-design based on flexible metadata, featuring an online quantization with simple encoding. To support the proposed method efficiently, we implement a lightweight hardware unit and integrate it into the accelerator. Evaluation results demonstrate that our method substantially narrows the accuracy gap, achieving on average a 70.63% reduction in accuracy loss compared to MXFP4 and a 37.30% reduction relative to the latest NVFP4 on LLM benchmarks. Furthermore, our design delivers up to 1.91$\times$ speedup and 1.75$\times$ energy savings over state-of-the-art accelerators. Our code is available at https://github.com/SJTU-ReArch-Group/M2XFP_ASPLOS26.
翻译:现有的低位微缩放(MX)格式,例如MXFP4,由于采用与二次幂格式共享缩放因子的方式,通常会导致显著的精度损失。在本工作中,我们探索了引入最少元数据的策略,以在保持高比特效率的同时,恢复量化过程中损失的精度,该方法适用于广泛的大型语言模型。我们提出了一种基于灵活元数据的完整算法-硬件协同设计,其特点是采用简单编码的在线量化。为了高效支持所提出的方法,我们实现了一个轻量级硬件单元并将其集成到加速器中。评估结果表明,我们的方法显著缩小了精度差距,在LLM基准测试中,与MXFP4相比平均减少了70.63%的精度损失,与最新的NVFP4相比减少了37.30%的精度损失。此外,我们的设计相比最先进的加速器,实现了高达1.91倍的加速和1.75倍的节能。我们的代码可在 https://github.com/SJTU-ReArch-Group/M2XFP_ASPLOS26 获取。