Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.
翻译:降低精度数据格式对于经济高效地部署大语言模型至关重要。尽管迄今为止已引入多种降低精度格式,但它们通常需要对软件框架进行侵入式修改,或因其非传统特性而难以在硬件供应商间广泛采用。本文聚焦于近期业界驱动的块浮点格式变体,并通过全面分析以突破其在高效大语言模型服务中的性能极限。分析表明,现有超低位宽BFP变体因块内异常值的存在而难以提供合理的语言模型性能。为解决BFP格式的异常值问题,我们提出MX+——一种经济高效且非侵入式的扩展方案,专为无缝集成至微缩放格式而设计。MX+基于关键洞见:异常值无需使用其元素数据类型中的指数域,这使我们能够将指数域重新用作扩展尾数,从而提升异常值元素的精度。评估结果显示,与4位MX格式相比,MX+在存储开销和速度延迟可忽略的前提下实现了显著更高的模型性能,为高效大语言模型推理提供了优于MXFP4或MXFP6的替代方案。