面向长上下文LLM微调的CXL附加内存分析与优化分配 (Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning)

The substantial memory requirements of Large Language Models (LLMs), particularly for long-context fine-tuning, have renewed interest in CPU offloading to augment limited GPU memory. However, as context lengths grow, relying on CPU memory for intermediate states introduces a significant bottleneck that can exhaust the capacity of mainstream client platforms. To address this limitation, this work investigates the effectiveness of Compute Express Link (CXL) add-in card (AIC) memory as an extension to CPU memory, enabling larger model sizes and longer context lengths during fine-tuning. Extensive benchmarking reveals two critical challenges. First, current deep learning frameworks such as PyTorch lack fine-grained, per-tensor control over NUMA memory allocation, exposing only coarse, process-level policies. Second, due to this lack of control, when the memory footprint of fine-tuning is offloaded across local DRAM and CXL-attached memory, naively placing optimizer data in higher-latency CXL leads to substantial slowdowns in the optimizer step (e.g., 4x once data exceeds 20M elements). To overcome these challenges, this work introduces a PyTorch extension that enables tensor-level system memory control and a CXL-aware memory allocator that pins latency-critical tensors in local DRAM while maximizing bandwidth by striping latency-tolerant tensors across one or more CXL devices. Evaluated on a real hardware setup with 7B and 12B models, 4K-32K contexts, and a single GPU, our approach recovers throughput to 97-99% of DRAM-only with a single AIC and approximately 100% with two AICs, delivering up to 21% improvement over naive interleaving while preserving DRAM-like DMA bandwidth for GPU transfers. These results show that carefully managed CXL-attached memory is a practical path to scaling long-context fine-tuning beyond DRAM limits.

翻译：大型语言模型（LLMs）的巨大内存需求，特别是长上下文微调场景，重新激发了通过CPU卸载来扩展有限GPU内存的研究兴趣。然而，随着上下文长度增长，依赖CPU内存存储中间状态会引入显著瓶颈，可能耗尽主流客户端平台容量。为应对此限制，本研究探讨了计算快速链接（CXL）附加卡（AIC）内存作为CPU内存扩展的有效性，以支持更大模型规模和更长上下文长度的微调。大量基准测试揭示了两大关键挑战：首先，当前深度学习框架（如PyTorch）缺乏对NUMA内存分配的细粒度张量级控制，仅提供粗粒度的进程级策略；其次，由于缺乏控制机制，当微调内存占用被卸载至本地DRAM和CXL附加内存时，将优化器数据简单放置于高延迟CXL内存会导致优化器步骤显著减速（例如数据超过2000万个元素时减速达4倍）。为克服这些挑战，本研究提出一种支持张量级系统内存控制的PyTorch扩展模块，以及一种CXL感知内存分配器——该分配器将延迟敏感张量固定在本地DRAM中，同时通过将延迟容忍张量条带化分布至一个或多个CXL设备来最大化带宽。在配备7B和12B模型、4K-32K上下文长度及单GPU的真实硬件环境中评估表明：采用单个AIC时，我们的方法可恢复至纯DRAM方案97-99%的吞吐量；采用双AIC时可达约100%，相比简单交错分配提升高达21%，同时为GPU传输保持类DRAM的DMA带宽。这些结果表明，精心管理的CXL附加内存是实现长上下文微调突破DRAM限制的可行路径。