Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with case studies of training large models on cluster configurations of variable compute, memory, and network resources. Our case studies demonstrate COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters. To illustrate, cluster configuration comparisons identify performance differences of up to 7.7x and highlight performance optimization opportunities of up to 1.4x when employing memory expansion as an optimization technique.
翻译:现代深度学习模型已发展至需要大规模专用高端节点集群进行训练的规模。设计此类集群以最大化性能和利用率——从而分摊其高昂成本——是一项极具挑战性的任务,需要精心平衡计算、内存和网络资源。此外,每种模型的众多调优参数会显著影响性能,其最优值往往取决于底层集群特性,这催生了复杂的集群-工作负载协同设计流程。为促进此类大规模深度学习训练集群的设计空间探索,我们提出了COMET——一种整体性的集群设计方法论与工作流,能够协同研究并行化策略与关键集群资源配置对分布式深度学习训练性能的影响。我们开发了建立可复用且灵活方法论的逐步流程,并通过在具有可变计算、内存和网络资源的集群配置上训练大型模型的案例研究,展示其应用。我们的案例研究证明了COMET在识别有前景的架构优化方向,以及指导系统设计人员配置关键模型与集群参数方面的实用性。例如,集群配置对比显示性能差异最高达7.7倍,并揭示采用内存扩展作为优化技术时可实现高达1.4倍的性能优化机会。