The substantial memory requirements of Large Language Models (LLMs), particularly for long-context fine-tuning, have renewed interest in CPU offloading to augment limited GPU memory. However, as context lengths grow, relying on CPU memory for intermediate states introduces a significant bottleneck that can exhaust the capacity of mainstream client platforms. To address this limitation, this work investigates the effectiveness of Compute Express Link (CXL) add-in card (AIC) memory as an extension to CPU memory, enabling larger model sizes and longer context lengths during fine-tuning. Extensive benchmarking reveals two critical challenges. First, current deep learning frameworks such as PyTorch lack fine-grained, per-tensor control over NUMA memory allocation, exposing only coarse, process-level policies. Second, due to this lack of control, when the memory footprint of fine-tuning is offloaded across local DRAM and CXL-attached memory, naively placing optimizer data in higher-latency CXL leads to substantial slowdowns in the optimizer step (e.g., 4x once data exceeds 20M elements). To overcome these challenges, this work introduces a PyTorch extension that enables tensor-level system memory control and a CXL-aware memory allocator that pins latency-critical tensors in local DRAM while maximizing bandwidth by striping latency-tolerant tensors across one or more CXL devices. Evaluated on a real hardware setup with 7B and 12B models, 4K-32K contexts, and a single GPU, our approach recovers throughput to 97-99% of DRAM-only with a single AIC and approximately 100% with two AICs, delivering up to 21% improvement over naive interleaving while preserving DRAM-like DMA bandwidth for GPU transfers. These results show that carefully managed CXL-attached memory is a practical path to scaling long-context fine-tuning beyond DRAM limits.
翻译:大型语言模型(LLMs)的巨大内存需求,特别是长上下文微调场景,重新激发了通过CPU卸载来扩展有限GPU内存的研究兴趣。然而,随着上下文长度增长,依赖CPU内存存储中间状态会引入显著瓶颈,可能耗尽主流客户端平台容量。为应对此限制,本研究探讨了计算快速链接(CXL)附加卡(AIC)内存作为CPU内存扩展的有效性,以支持更大模型规模和更长上下文长度的微调。大量基准测试揭示了两大关键挑战:首先,当前深度学习框架(如PyTorch)缺乏对NUMA内存分配的细粒度张量级控制,仅提供粗粒度的进程级策略;其次,由于缺乏控制机制,当微调内存占用被卸载至本地DRAM和CXL附加内存时,将优化器数据简单放置于高延迟CXL内存会导致优化器步骤显著减速(例如数据超过2000万个元素时减速达4倍)。为克服这些挑战,本研究提出一种支持张量级系统内存控制的PyTorch扩展模块,以及一种CXL感知内存分配器——该分配器将延迟敏感张量固定在本地DRAM中,同时通过将延迟容忍张量条带化分布至一个或多个CXL设备来最大化带宽。在配备7B和12B模型、4K-32K上下文长度及单GPU的真实硬件环境中评估表明:采用单个AIC时,我们的方法可恢复至纯DRAM方案97-99%的吞吐量;采用双AIC时可达约100%,相比简单交错分配提升高达21%,同时为GPU传输保持类DRAM的DMA带宽。这些结果表明,精心管理的CXL附加内存是实现长上下文微调突破DRAM限制的可行路径。