Existing low-bit Microscaling (MX) formats, such as MXFP4, often suffer from substantial accuracy degradation due to the use of a shared scaling factor with the Power-of-Two format. In this work, we explore strategies that introduce minimal metadata to recover accuracy lost during quantization while maintaining high bit efficiency across a wide range of large language models. We propose a complete algorithm-hardware co-design based on flexible metadata, featuring an online quantization with simple encoding. To support the proposed method efficiently, we implement a lightweight hardware unit and integrate it into the accelerator. Evaluation results demonstrate that our method substantially narrows the accuracy gap, achieving on average a 70.63% reduction in accuracy loss compared to MXFP4 and a 37.30% reduction relative to the latest NVFP4 on LLM benchmarks. Furthermore, our design delivers up to 1.91$\times$ speedup and 1.75$\times$ energy savings over state-of-the-art accelerators. Our code is available at https://github.com/SJTU-ReArch-Group/M2XFP_ASPLOS26.
翻译:现有的低位微缩放(MX)格式,例如 MXFP4,由于采用与二次幂格式共享缩放因子的方式,常常遭受显著的精度损失。在本工作中,我们探索了引入最小化元数据的策略,以恢复量化过程中丢失的精度,同时在广泛的大型语言模型中保持高比特效率。我们提出了一种基于灵活元数据的完整算法-硬件协同设计,其特点是采用简单编码的在线量化。为了高效支持所提出的方法,我们实现了一个轻量级硬件单元并将其集成到加速器中。评估结果表明,我们的方法显著缩小了精度差距,在 LLM 基准测试中,与 MXFP4 相比,平均减少了 70.63% 的精度损失,相对于最新的 NVFP4 则减少了 37.30%。此外,与最先进的加速器相比,我们的设计实现了高达 1.91$\times$ 的加速比和 1.75$\times$ 的能耗节省。我们的代码可在 https://github.com/SJTU-ReArch-Group/M2XFP_ASPLOS26 获取。