The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.
翻译:大语言模型(LLM)的快速发展已超越单GPU硬件的演进速度,使得模型规模日益受限于内存容量而非计算能力。尽管现代训练系统通过分布式并行以及跨CPU和存储层的卸载机制扩展了GPU内存,但其本质上仍保留了以GPU为中心的执行范式,即GPU承载持久的模型副本和完整的自动微分计算图。因此,扩展大模型仍然紧密依赖于多GPU集群、复杂的分布式运行时以及不可预测的主机内存消耗,这为节点级后训练工作负载(如指令微调、对齐和领域适应)带来了显著障碍。本文提出Horizon-LM,一种以内存为中心的训练系统,重新定义了CPU和GPU在大模型优化中的角色。Horizon-LM将主机内存作为权威的参数存储库,并通过CPU主导、GPU模板化的执行模型,仅将GPU用作瞬态计算引擎。通过消除持久驻留GPU的模块和自动微分计算图,采用显式重计算与手动梯度传播,并引入流水线双缓冲执行引擎,Horizon-LM将模型规模与GPU数量解耦,并将内存使用量限制在理论参数占用量内。在配备1.5TB主机内存的单H200 GPU上,Horizon-LM可稳定训练高达1200亿参数的模型。在标准的单A100机器上,Horizon-LM相比具备CPU卸载功能的DeepSpeed ZeRO-3实现了高达12.2倍的训练吞吐量提升,同时保持数值正确性。在不同平台和规模下,Horizon-LM均能维持高设备利用率和可预测的内存增长,证明主机内存(而非GPU内存)定义了节点级大模型训练的实际可行性边界。