In up-to-date machine learning (ML) applications on cloud or edge computing platforms, batching is an important technique for providing efficient and economical services at scale. In particular, parallel computing resources on the platforms, such as graphics processing units (GPUs), have higher computational and energy efficiency with larger batch sizes. However, larger batch sizes may also result in longer response time, and thus it requires a judicious design. This paper aims to provide a dynamic batching policy that strikes a balance between efficiency and latency. The GPU-based inference service is modeled as a batch service queue with batch-size dependent processing time. Then, the design of dynamic batching is a continuous-time average-cost problem, and is formulated as a semi-Markov decision process (SMDP) with the objective of minimizing the weighted sum of average response time and average power consumption. The optimal policy is acquired by solving an associated discrete-time Markov decision process (MDP) problem with finite state approximation and "discretization". By introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Our results show that the optimal policies potentially possess a control limit structure. Numerical results also show that SMDP-based batching policies can adapt to different traffic intensities and outperform other benchmark policies. Furthermore, the proposed solution has notable flexibility in balancing power consumption and latency.
翻译:在云或边缘计算平台的最新机器学习应用中,批处理是提供规模化高效经济服务的重要技术。尤其是平台上的并行计算资源(如图形处理器GPU)在大批量处理时具有更高的计算和能效,但更大的批处理规模也可能导致更长的响应时间,因此需要审慎设计。本文旨在提出一种兼顾效率与延迟的动态批处理策略。我们将基于GPU的推理服务建模为处理时间依赖批处理规模的批服务队列,进而将动态批处理设计转化为连续时间平均成本问题,并构建为以最小化平均响应时间与平均功耗加权和为目标的半马尔可夫决策过程(SMDP)。通过有限状态近似与"离散化"方法求解关联的离散时间马尔可夫决策过程(MDP)问题,即可获得最优策略。通过引入抽象成本反映"尾部"状态的影响,该过程的空间复杂度和时间复杂度分别降低63.5%和98%。研究结果表明最优策略具有潜在的控制限结构。数值实验还表明,基于SMDP的批处理策略能够适应不同流量强度,且性能优于其他基准策略。此外,所提方法在平衡功耗与延迟方面具有显著灵活性。