Horizon-LM: A RAM-Centric Architecture for LLM Training

from arxiv, This paper contained an error in the throughput computation used in the experimental evaluation. Specifically, the TFLOPS calculation omitted the 12HL term in the training FLOPs formula, which led to systematic underestimation of the reported throughput numbers in the experimental results. We are withdrawing this version to correct the evaluation and avoid confusion for readers

The rapid growth of large language models (LLMs) has outpaced the evolution of single-GPU hardware, making model scale increasingly constrained by memory capacity rather than computation. While modern training systems extend GPU memory through distributed parallelism and offloading across CPU and storage tiers, they fundamentally retain a GPU-centric execution paradigm in which GPUs host persistent model replicas and full autograd graphs. As a result, scaling large models remains tightly coupled to multi-GPU clusters, complex distributed runtimes, and unpredictable host memory consumption, creating substantial barriers for node-scale post-training workloads such as instruction tuning, alignment, and domain adaptation. We present Horizon-LM, a memory-centric training system that redefines the roles of CPU and GPU for large-model optimization. Horizon-LM treats host memory as the authoritative parameter store and uses GPUs solely as transient compute engines through a CPU-master, GPU-template execution model. By eliminating persistent GPU-resident modules and autograd graphs, employing explicit recomputation with manual gradient propagation, and introducing a pipelined double-buffered execution engine, Horizon-LM decouples model scale from GPU count and bounds memory usage to the theoretical parameter footprint. On a single H200 GPU with 1.5\,TB host RAM, Horizon-LM reliably trains models up to 120B parameters. On a standard single A100 machine, Horizon-LM achieves up to 12.2$\times$ higher training throughput than DeepSpeed ZeRO-3 with CPU offloading while preserving numerical correctness. Across platforms and scales, Horizon-LM sustains high device utilization and predictable memory growth, demonstrating that host memory, not GPU memory, defines the true feasibility boundary for node-scale large-model training.

翻译：大型语言模型（LLM）的快速发展已超越单GPU硬件的演进速度，使得模型规模日益受到内存容量的制约而非计算能力。尽管现代训练系统通过分布式并行以及跨CPU和存储层的数据卸载来扩展GPU内存，但其根本上仍保留着以GPU为中心的运行范式——GPU上承载持久的模型副本和完整的自动微分图。因此，扩展大型模型依然与多GPU集群、复杂的分布式运行时以及不可预测的主机内存消耗紧密耦合，这为节点规模的训练后任务（如指令微调、对齐和领域适配）设置了巨大障碍。我们提出Horizon-LM，一种重新定义CPU和GPU角色以优化大模型训练的内存中心式训练系统。Horizon-LM将主机内存视为权威参数存储，并通过CPU主控、GPU模板化的执行模型，将GPU仅用作瞬态计算引擎。通过消除GPU上持久驻留的模块和自动微分图、采用显式重计算与手动梯度传播，以及引入流水线双缓冲执行引擎，Horizon-LM将模型规模与GPU数量解耦，并将内存使用限制在理论参数占用范围内。在配备1.5 TB主机内存的单块H200 GPU上，Horizon-LM可稳定训练高达120B参数的模型。在标准单A100机器上，Horizon-LM相比采用CPU卸载的DeepSpeed ZeRO-3实现了高达12.2倍的训练吞吐量提升，同时保持数值正确性。跨平台与不同规模，Horizon-LM均维持高设备利用率与可预测的内存增长，这表明主机内存而非GPU内存决定了节点规模大模型训练的真实可行性边界。