How to Rent GPUs on a Budget

The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in demand for renting cloud-based GPUs. In this cloud-computing paradigm, a user must specify their demand for GPUs at every moment in time, and will pay for every GPU-hour they use. ML training jobs are known to be parallelizable to different degrees. Given a stream of ML training jobs, a user typically wants to minimize the mean response time across all jobs. Here, the response time of a job denotes the time from when a job arrives until it is complete. Additionally, the user is constrained by some operating budget. Specifically, in this paper the user is constrained to use no more than $b$ GPUs per hour, over a long-run time average. The question is how to minimize mean response time while meeting the budget constraint. Because training jobs receive a diminishing marginal benefit from running on additional GPUs, allocating too many GPUs to a single training job can dramatically increase the overall cost paid by the user. Hence, an optimal rental policy must balance a tradeoff between training cost and mean response time. This paper derives the optimal rental policy for a stream of training jobs where the jobs have different levels of parallelizability (specified by a speedup function) and different job sizes (amounts of inherent work). We make almost no assumptions about the arrival process and about the job size distribution. Our optimal policy specifies how many GPUs to rent at every moment in time and how to allocate these GPUs.

翻译：过去十年间，机器学习（ML）的爆发式增长导致训练ML模型所需的GPU需求急剧增加。由于构建和维护大型GPU集群对大多数用户而言成本过高，大型云服务提供商（微软Azure、亚马逊AWS、谷歌云）在云端GPU租赁需求方面经历了爆炸性增长。在这种云计算模式下，用户必须指定每个时刻所需的GPU数量，并按实际使用的GPU小时数付费。已知ML训练任务在不同程度上具有可并行性。面对一连串ML训练任务，用户通常希望最小化所有任务的平均响应时间。此处，任务的响应时间指从任务到达至完成所经历的时间。此外，用户还受到特定运营预算的约束。具体而言，本文假设用户在长期时间平均上每小时使用的GPU数量不得超过$b$个。核心问题在于如何在满足预算约束的同时最小化平均响应时间。由于训练任务在额外GPU上运行时获得的边际效益递减，为单个训练任务分配过多GPU会显著增加用户的总成本。因此，最优租赁策略必须在训练成本与平均响应时间之间取得平衡。本文针对具有不同并行化程度（由加速函数指定）和不同任务规模（固有工作量）的训练任务流，推导出最优租赁策略。我们几乎不对到达过程及任务规模分布做任何假设。所提出的最优策略明确了每个时刻应租赁的GPU数量以及这些GPU的分配方式。