Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations -- gradient fusion and chunking -- enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method, Renee without compromising accuracy.
翻译:大规模输出空间,亦称极端多标签分类(XMC),常见于大规模标注和产品间推荐等场景,其标签数量可达数十万至数百万级别。这意味着通常仅占整体模型极小部分的线性分类头,反而成为计算与内存需求的主要来源。当前最先进的XMC方法主要依赖FP16-FP32混合精度训练,本文指出该方法存在训练不稳定、内存利用效率低且计算开销大的问题。而现有低精度方法通常仍为分类层保留较高精度。本研究提出ELMO——一种基于BFloat16与Float8数据类型的纯低精度XMC模型训练框架。通过结合Kahan求和与随机舍入技术,我们证明XMC模型可完全在Float8精度下有效训练,无需依赖单精度主权重或张量缩放。低精度训练与我们提出的梯度融合及分块内存优化技术相结合,能显著降低GPU内存占用。例如,我们在仅使用6.6 GiB GPU内存的条件下成功训练了包含300万个标签的XMC模型,而优化后的现有最优方法Renee需要39.7 GiB内存,且我们的方法在精度上未出现损失。