Memory pressure has emerged as a dominant constraint in scaling the training of large language models (LLMs), particularly in resource-constrained environments. While modern frameworks incorporate various memory-saving techniques, they often expose low-level configuration knobs that require manual tuning and specialized system expertise. This not only adds engineering overhead but also risks suboptimal hardware utilization when misconfigured. This paper introduces ProTrain, a novel training system that automatically tailors memory management policies to the model architecture and underlying hardware resources, eliminating the need for manual intervention. The core of ProTrain is its automated memory management that abstracts complex memory management strategies into a few tunable configuration parameters, allowing searches for optimal parameter settings using cost models. ProTrain is equipped with a runtime profiler that provides precise estimates of latency, memory usage, and I/O bandwidth to build high-fidelity cost models. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the state-of-the-art training systems.
翻译:内存压力已成为扩展大语言模型(LLMs)训练规模的主要制约因素,尤其是在资源受限的环境中。虽然现代框架整合了多种内存节省技术,但它们通常暴露出需要手动调优和专门系统专业知识底层配置参数。这不仅增加了工程开销,而且在配置不当的情况下存在硬件利用率次优的风险。本文介绍ProTrain——一种新型训练系统,能够根据模型架构和底层硬件资源自动定制内存管理策略,无需人工干预。其核心是自动化内存管理,将复杂的内存管理策略抽象为少量可调配置参数,并允许通过成本模型搜索最优参数设置。ProTrain配备运行时分析器,可精确估算延迟、内存使用量和I/O带宽,从而构建高保真成本模型。ProTrain不改变训练算法,因此不牺牲模型精度。实验表明,与最先进的训练系统相比,ProTrain训练吞吐量提升1.43倍至2.71倍。