Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.
翻译:大规模语言模型(LLMs)已在多项变革性任务中展现出卓越性能。然而,高效利用大规模集群资源进行LLM开发并非易事,其间充斥着频繁的硬件故障、复杂的并行策略以及不均衡的资源利用等诸多挑战。本文针对从Acme GPU数据中心采集的六个月LLM开发工作负载追踪数据,开展深入的特征分析研究。具体而言,我们探究了LLM与传统任务特定深度学习工作负载之间的差异,剖析了资源利用模式,并识别了各类任务故障的影响。通过分析,我们总结了开发过程中遇到的障碍,并发现了专为LLM优化的系统潜在改进契机。此外,我们介绍了系统层面的实践:(1)容错预训练,通过结合LLM的故障诊断与自动恢复机制增强容错能力;(2)评估任务解耦调度,通过试验分解与调度优化实现及时的反馈输出。