Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24\%) at marginal throughput loss (up to 2.4\%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency.
翻译:数据格式创新对机器学习(ML)的规模扩展至关重要,进而推动了突破性ML能力的实现。然而,即便在低精度格式存在的情况下,模型权重在训练过程中仍常以高精度和低精度两种形式存储。此外,随着新兴方向性数据格式(如MX9、MX6等)的出现,可能需要维护多个低精度权重副本。为降低权重的内存容量需求,我们探索了即时量化(JIT-Q)方法:仅将高精度权重存储在内存中,仅在需要时生成低精度权重。为高效实现JIT-Q,本研究评估了新兴的存内处理(PIM)技术以执行量化操作。借助PIM,可将量化任务卸载至内存内计算单元,从而在不产生昂贵数据搬运开销的同时,实现量化过程与加速器计算的并行执行。我们提出的PIM卸载量化方案可跟上GPU计算速度,在吞吐量损失极小(最高2.4%)的前提下,实现显著的内存容量节约(最高24%)。上述内存容量节约可带来多项优势,包括在相同系统中容纳更大规模模型、降低模型并行需求,以及提升整体ML训练效率。