It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the SOTA training systems.
翻译:训练大语言模型(LLM)需要消耗极高的内存。为解决此问题,现有工作利用CPU与GPU协同训练,例如ZeRO-Offload技术。此类技术大幅降低了十亿级模型训练的门槛,使得仅用少量消费级显卡即可完成训练。然而,根据我们的观察,现有框架通常提供粗粒度的内存管理,且需要经验丰富的专家进行配置调优,导致硬件利用率和性能未能达到最优。本文提出ProTrain,一种通过协调内存、计算与IO来智能平衡内存使用与性能的新型训练系统。ProTrain通过基于分块的模型状态管理(Chunk-Based Model State Management)和基于块的激活管理(Block-Wise Activation Management),在内存感知运行时分析器(Memory-Aware Runtime Profiler)的引导下实现自适应内存管理,无需用户干预。ProTrain不改变训练算法,因此不会影响模型精度。实验表明,与现有最优训练系统相比,ProTrain可将训练吞吐量提升1.43倍至2.71倍。