Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.
翻译:大语言模型(LLM)在多个变革性任务中展现出卓越性能。然而,高效利用大规模集群资源开发LLM并非易事,常面临硬件故障频发、并行策略复杂、资源利用率失衡等诸多挑战。本文基于从GPU数据中心Acme收集的六个月LLM开发工作负载轨迹,开展深度特征研究。具体而言,我们分析了LLM与既往任务特定深度学习(DL)工作负载之间的差异,探究了资源利用模式,并识别了各类作业故障的影响。本研究归纳了实践中遇到的障碍,揭示了针对LLM进行系统优化的潜在机遇。此外,我们介绍了系统的两项创新措施:(1)容错预训练——通过LLM参与的故障诊断与自动恢复增强容错能力;(2)评估解耦调度——通过试验分解与调度优化实现及时的性能反馈。