Production heterogeneous supercomputing platforms are increasingly used to host large language model (LLM) training workloads. However, existing GPU-oriented training runtimes typically rely on high-bandwidth device memory, fast interconnects, and mature collective communication libraries, making them difficult to directly adapt to MT-3000, a platform with an explicit memory hierarchy, limited usable DDR capacity, and constrained inter-cluster communication. This paper presents RATrain, a resource-aware training runtime for dense LLMs on bandwidth-constrained heterogeneous supercomputing platforms. RATrain formulates standard non-interleaved 1F1B training as a training-state lifecycle scheduling problem, and schedules gradient synchronization, parameter update, parameter-view prefetching, and activation recovery at layer-level and stage-local granularity. RATrain further combines an MT-3000-aware execution backend for efficient and predictable FP16 GEMM, Attention Backward, and explicit data movement with a resource-aware planner that selects feasible training configurations under the 20GB usable-DDR constraint per compute cluster. We implement RATrain on a real MT-3000 platform and evaluate it using LLaMA-2-7B, Baichuan2-13B, Qwen2.5-32B, and LLaMA-2-70B configurations. Results show that RATrain achieves up to 1.35$\times$ end-to-end speedup over MT-3000-adapted GPU-style training strategies. For LLaMA-2-7B, RATrain scales to 1024 compute clusters, reaches 112,790.55 tokens/s, and achieves 97.0\% scaling efficiency. A further 1.028B-token correctness run shows that RATrain preserves the loss trajectory of a semantically equivalent Baseline-1F1B run, with a maximum relative loss deviation of 0.081\%.
翻译:生产级异构超级计算平台正越来越多地用于承载大语言模型(LLM)的训练任务。然而,现有面向GPU的训练运行时通常依赖高带宽设备内存、快速互联和成熟的集合通信库,这使得它们难以直接适配MT-3000平台——该平台具有显式内存层级结构、有限的可用DDR容量以及受限的集群间通信能力。本文提出RATrain,一种面向密集LLM的、适用于带宽受限异构超级计算平台的资源感知训练运行时。RATrain将标准的非交错式1F1B训练形式化为一个训练状态生命周期调度问题,并在层级别和阶段本地粒度上调度梯度同步、参数更新、参数视图预取和激活恢复。RATrain进一步结合了一个MT-3000感知的执行后端(用于高效且可预测的FP16 GEMM、Attention Backward及显式数据移动)与一个资源感知规划器(可在每个计算集群20GB可用DDR约束下选择可行的训练配置)。我们在真实的MT-3000平台上实现RATrain,并使用LLaMA-2-7B、Baichuan2-13B、Qwen2.5-32B和LLaMA-2-70B配置对其进行评估。结果表明,与MT-3000适配的GPU式训练策略相比,RATrain实现了高达1.35倍的端到端加速。对于LLaMA-2-7B,RATrain扩展到1024个计算集群,达到了112,790.55 tokens/s的吞吐量,并实现了97.0%的扩展效率。进一步的10.28亿token正确性测试表明,RATrain保持了语义等价的Baseline-1F1B运行的损失轨迹,最大相对损失偏差为0.081%。