In up-to-date machine learning (ML) applications on cloud or edge computing platforms, batching is an important technique for providing efficient and economical services at scale. In particular, parallel computing resources on the platforms, such as graphics processing units (GPUs), have higher computational and energy efficiency with larger batch sizes. However, larger batch sizes may also result in longer response time, and thus it requires a judicious design. This paper aims to provide a dynamic batching policy that strikes a balance between efficiency and latency. The GPU-based inference service is modeled as a batch service queue with batch-size dependent processing time. Then, the design of dynamic batching is a continuous-time average-cost problem, and is formulated as a semi-Markov decision process (SMDP) with the objective of minimizing the weighted sum of average response time and average power consumption. The optimal policy is acquired by solving an associated discrete-time Markov decision process (MDP) problem with finite state approximation and "discretization". By creatively introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Our results show that the optimal policies potentially possess a control limit structure. Numerical results also show that SMDP-based batching policies can adapt to different traffic intensities and outperform other benchmark policies. Furthermore, the proposed solution has notable flexibility in balancing power consumption and latency.
翻译:在云端或边缘计算平台上的最新机器学习应用中,批处理是实现大规模高效经济服务的关键技术。特别是图形处理器(GPU)等并行计算资源,随着批量大小的增加,计算效率和能效均显著提升。然而,过大的批量大小也可能导致响应时间延长,因此需要审慎设计。本文旨在提出一种动态批处理策略,以平衡效率与延迟。我们将基于GPU的推理服务建模为处理时间依赖于批量大小的批服务队列。在此基础上,动态批处理设计被建模为连续时间平均成本问题,并形式化为半马尔可夫决策过程(SMDP),其目标是最小化平均响应时间与平均功耗的加权和。通过有限状态近似和"离散化"方法,求解关联的离散时间马尔可夫决策过程(MDP)问题以获得最优策略。我们创造性地引入抽象成本来反映"尾"状态的影响,使算法的空间复杂度和时间复杂度分别降低63.5%和98%。结果表明,最优策略可能具备控制限结构。数值结果亦显示,基于SMDP的批处理策略能够适应不同的流量强度,并优于其他基准策略。此外,所提方案在平衡功耗与延迟方面具有显著灵活性。