BOA Constrictor: Squeezing Performance out of GPUs in the Cloud via Budget-Optimal Allocation

The past decade has seen a dramatic increase in demand for GPUs to train Machine Learning (ML) models. Because it is prohibitively expensive for most organizations to build and maintain a large GPU cluster, organizations instead choose to rent GPUs from cloud providers. The customer is responsible for devising a policy for (i) deciding how many GPUs to rent at every moment in time to process a stream of ML training jobs and (ii) allocating the rented GPUs among the currently active jobs in the system. Because ML training jobs can be parallelized across different numbers of GPUs, the customer generally has many options for how many GPUs to use for each job. Allocating more GPUs to a single training job will cause the job to complete more quickly. However, the customer pays for each GPU-hour they use, and a training job receives a diminishing marginal benefit from running on additional GPUs. Hence, allocating too many GPUs to a single training job can dramatically increase the overall cost that the customer pays to the cloud provider. This gives rise to a cost-performance tradeoff that customers must balance when running training jobs in the cloud. To balance the cost-performance tradeoff, we develop BOA Constrictor, a new scheduler for ML training jobs which uses a Budget-Optimal Allocation (BOA) policy to squeeze the highest level of performance out of a cloud-deployed GPU cluster given a fixed budget constraint. We explicitly formulate the problem as a budget-constrained scheduling problem and derive the BOA policy which minimizes the average job completion time (JCT) of a stream of arriving jobs subject to the user's budget. For a given budget level, we demonstrate that BOA Constrictor can reduce average JCT by 1.6 times in small-scale implementation experiments and by 2 times in detailed, large-scale simulations compared to state-of-the-art heuristic based schedulers.

翻译：过去十年，用于训练机器学习（ML）模型的GPU需求急剧增长。由于构建和维护大规模GPU集群对大多数组织而言成本过高，这些组织转而选择从云服务提供商处租用GPU。客户需要负责制定策略，以（i）决定在每一时刻租用多少GPU来处理持续到达的ML训练任务流，以及（ii）将租用的GPU在当前系统中活跃的任务之间进行分配。由于ML训练任务可以在不同数量的GPU上进行并行化，客户通常对每个任务使用多少GPU有多种选择。为单个训练任务分配更多GPU会使其更快完成。然而，客户需要为所使用的每个GPU小时付费，而训练任务从运行在额外GPU上获得的边际效益是递减的。因此，为单个训练任务分配过多GPU会显著增加客户支付给云服务提供商的总体成本。这就产生了一个成本与性能之间的权衡，客户在云端运行训练任务时必须予以平衡。为了平衡这一成本-性能权衡，我们开发了BOA Constrictor，这是一种用于ML训练任务的新型调度器，它采用预算最优分配（BOA）策略，在给定固定预算约束下，从云端部署的GPU集群中榨取最高水平的性能。我们将该问题明确表述为一个预算约束下的调度问题，并推导出BOA策略，该策略在用户预算约束下最小化到达任务流的平均任务完成时间（JCT）。对于给定的预算水平，我们通过实验证明，与基于最先进启发式方法的调度器相比，BOA Constrictor在小规模实际部署实验中能将平均JCT降低1.6倍，在详细的大规模仿真中能降低2倍。