Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at https://github.com/wenhaoli-xmu/OOMB.
翻译:在长上下文上训练大语言模型(LLMs)主要受限于高昂的GPU内存开销,而非训练时间。其根本原因在于激活值,其内存占用随序列长度线性增长。本文提出OOMB,一种高度内存高效的训练系统,直接应对这一瓶颈。我们的方法采用基于分块的循环训练框架,配合即时激活重计算,从而维持恒定的激活内存占用(O(1)),并将主要瓶颈转移至不断增长的KV缓存。为管理KV缓存,OOMB集成了一套协同优化技术:为KV缓存及其梯度引入分页内存管理器以消除内存碎片,通过异步CPU卸载隐藏数据传输延迟,并采用页面级稀疏注意力以降低计算复杂度和通信开销。这些技术的协同作用带来了卓越的效率。实验结果表明,对于Qwen2.5-7B模型,上下文每增加1万令牌,端到端训练内存开销仅增加约10MB。这使得在单张H200 GPU上即可训练具有400万令牌上下文的Qwen2.5-7B模型,而使用上下文并行技术通常需要大规模计算集群。本工作在长上下文LLM训练的资源效率方面实现了显著进步。源代码发布于 https://github.com/wenhaoli-xmu/OOMB。