Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

Second-order methods offer an attractive path toward more sample-efficient LLM training, but their practical use is often blocked by the systems cost of maintaining and updating large matrix-based optimizer states. We introduce \textbf{Asteria}, a runtime system designed to remove this bottleneck by separating second-order optimization logic from the critical GPU training path. Rather than keeping all preconditioner state on the accelerator, Asteria dynamically distributes optimizer state across GPU memory, CPU memory, and optional NVMe storage according to architectural constraints and runtime pressure. It further uses training hooks to prepare shadow states in advance, allowing expensive inverse-root computations to proceed asynchronously on the host while GPU computation continues. For distributed training, Asteria employs a bounded-staleness protocol that limits synchronization frequency while preserving optimizer effectiveness through topology-aware coordination. We evaluate Asteria on both memory-constrained and distributed training settings. On a DGX Spark platform with a single GB10 GPU and 128GB unified memory, Asteria supports second-order training for a 1B-parameter language model. On multi-node GH200 systems, it lowers visible optimizer overhead, reduces recurring latency spikes, accelerates convergence in wall-clock time, and maintains the optimization advantages of SOAP and KL-Shampoo in a 7B-parameter language model. Our results suggest that second-order LLM training can be made practical not by simplifying the optimizer alone, but by rethinking how optimizer state, background computation, and distributed synchronization are managed at the runtime level.

翻译：二阶方法为提升大语言模型（LLM）训练的样本效率提供了有前景的路径，但其实际应用常受制于维护和更新大型矩阵优化器状态所需的系统开销。我们提出**Asteria**——一种旨在通过将二阶优化逻辑与GPU关键训练路径分离来消除该瓶颈的运行时系统。Asteria并非将所有预条件子状态保留在加速器上，而是根据架构约束和运行时压力，动态地将优化器状态分配至GPU内存、CPU内存以及可选的NVMe存储中。此外，它利用训练钩子预先生成影子状态，使得昂贵的逆根计算能够在GPU计算持续进行的同时异步地在主机端完成。针对分布式训练，Asteria采用有界陈旧性协议，限制同步频率，同时通过拓扑感知协调保持优化器的有效性。我们在内存受限和分布式训练两种场景下评估了Asteria。在配备单块GB10 GPU和128GB统一内存的DGX Spark平台上，Asteria实现了对10亿参数语言模型的二阶训练。在多节点GH200系统中，它降低了显性的优化器开销，减少了周期性延迟尖峰，加速了挂钟时间下的收敛速度，并在70亿参数语言模型中保持了SOAP和KL-Shampoo的优化优势。我们的结果表明，二阶LLM训练的可实践性并非仅通过简化优化器本身实现，而需在运行时层面重新思考优化器状态、后台计算与分布式同步的管理方式。