The orchestration of deep neural network (DNN) model inference on GPU clusters presents two significant challenges: achieving high accelerator efficiency given the batching properties of model inference while meeting latency service level objectives (SLOs), and adapting to workload changes both in terms of short-term fluctuations and long-term resource allocation. To address these challenges, we propose Symphony, a centralized scheduling system that can scale to millions of requests per second and coordinate tens of thousands of GPUs. Our system utilizes a non-work-conserving scheduling algorithm capable of achieving high batch efficiency while also enabling robust autoscaling. Additionally, we developed an epoch-scale algorithm that allocates models to sub-clusters based on the compute and memory needs of the models. Through extensive experiments, we demonstrate that Symphony outperforms prior systems by up to 4.7x higher goodput.
翻译:深度神经网络(DNN)模型在GPU集群上的推理编排面临两大挑战:在满足延迟服务等级目标(SLO)的前提下,利用模型推理的批处理特性实现高加速器效率;以及适应工作负载的短期波动与长期资源分配变化。为应对这些挑战,我们提出Symphony——一种可扩展至每秒百万级请求并协调数万个GPU的集中式调度系统。该系统采用非工作守恒调度算法,既能实现高批处理效率,又能支持鲁棒的自动扩缩容。此外,我们开发了一种基于训练周期(epoch)级别的算法,根据模型的计算与内存需求将其分配至子集群。通过大量实验证明,Symphony的有效吞吐量(goodput)较现有系统最高可提升4.7倍。