A Tale of Two Scales: Reconciling Horizontal and Vertical Scaling for Inference Serving Systems

Inference serving is of great importance in deploying machine learning models in real-world applications, ensuring efficient processing and quick responses to inference requests. However, managing resources in these systems poses significant challenges, particularly in maintaining performance under varying and unpredictable workloads. Two primary scaling strategies, horizontal and vertical scaling, offer different advantages and limitations. Horizontal scaling adds more instances to handle increased loads but can suffer from cold start issues and increased management complexity. Vertical scaling boosts the capacity of existing instances, allowing for quicker responses but is limited by hardware and model parallelization capabilities. This paper introduces Themis, a system designed to leverage the benefits of both horizontal and vertical scaling in inference serving systems. Themis employs a two-stage autoscaling strategy: initially using in-place vertical scaling to handle workload surges and then switching to horizontal scaling to optimize resource efficiency once the workload stabilizes. The system profiles the processing latency of deep learning models, calculates queuing delays, and employs different dynamic programming algorithms to solve the joint horizontal and vertical scaling problem optimally based on the workload situation. Extensive evaluations with real-world workload traces demonstrate over $10\times$ SLO violation reduction compared to the state-of-the-art horizontal or vertical autoscaling approaches while maintaining resource efficiency when the workload is stable.

翻译：推理服务在将机器学习模型部署于实际应用中至关重要，它确保了推理请求的高效处理与快速响应。然而，此类系统中的资源管理面临重大挑战，尤其是在多变且不可预测的工作负载下维持性能。水平扩展与垂直扩展这两种主要扩展策略各具优势与局限。水平扩展通过增加实例数量以应对负载增长，但可能受冷启动问题困扰并增加管理复杂度。垂直扩展则通过提升现有实例的处理能力来实现更快的响应，但其扩展能力受限于硬件及模型并行化能力。本文介绍了Themis系统，该系统旨在协同利用推理服务系统中水平与垂直扩展的优势。Themis采用两阶段自动扩展策略：首先利用原位垂直扩展应对工作负载激增，待负载稳定后则切换至水平扩展以优化资源效率。该系统通过剖析深度学习模型的处理延迟、计算排队延迟，并运用不同的动态规划算法，根据工作负载状况最优地求解水平与垂直联合扩展问题。基于真实工作负载轨迹的大量评估表明，与最先进的纯水平或纯垂直自动扩展方法相比，Themis在稳定工作负载下保持资源效率的同时，将服务等级协议（SLO）违规率降低了超过10倍。