When deploying machine learning (ML) applications, the automated allocation of computing resources-commonly referred to as autoscaling-is crucial for maintaining a consistent inference time under fluctuating workloads. The objective is to maximize the Quality of Service metrics, emphasizing performance and availability, while minimizing resource costs. In this paper, we compare scalable deployment techniques across three levels of scaling: at the application level (TorchServe, RayServe) and the container level (K3s) in a local environment (production server), as well as at the container and machine levels in a cloud environment (Amazon Web Services Elastic Container Service and Elastic Kubernetes Service). The comparison is conducted through the study of mean and standard deviation of inference time in a multi-client scenario, along with upscaling response times. Based on this analysis, we propose a deployment strategy for both local and cloud-based environments.
翻译:在部署机器学习应用时,计算资源的自动化分配(即自动扩缩)对于在波动负载下维持一致的推理时间至关重要。其目标是在最小化资源成本的同时,最大化服务质量指标,重点关注性能与可用性。本文比较了三个扩缩层级上的可扩展部署技术:本地环境(生产服务器)中的应用层级(TorchServe、RayServe)与容器层级(K3s),以及云环境(Amazon Web Services 弹性容器服务和弹性 Kubernetes 服务)中的容器与机器层级。通过多客户端场景下推理时间的均值与标准差分析,结合向上扩缩响应时间的评估,我们提出了一种适用于本地及云环境的部署策略。