When deploying machine learning (ML) applications, the automated allocation of computing resources-commonly referred to as autoscaling-is crucial for maintaining a consistent inference time under fluctuating workloads. The objective is to maximize the Quality of Service metrics, emphasizing performance and availability, while minimizing resource costs. In this paper, we compare scalable deployment techniques across three levels of scaling: at the application level (TorchServe, RayServe) and the container level (K3s) in a local environment (production server), as well as at the container and machine levels in a cloud environment (Amazon Web Services Elastic Container Service and Elastic Kubernetes Service). The comparison is conducted through the study of mean and standard deviation of inference time in a multi-client scenario, along with upscaling response times. Based on this analysis, we propose a deployment strategy for both local and cloud-based environments.
翻译:在部署机器学习应用时,计算资源的自动分配(通常称为自动扩缩容)对于在波动负载下保持一致的推理时间至关重要。其目标是最大化服务质量指标(重点关注性能与可用性),同时最小化资源成本。本文在三个扩缩容层级上对比了可扩展的部署技术:本地环境(生产服务器)中的应用层扩缩容(TorchServe、RayServe)与容器层扩缩容(K3s),以及云环境(亚马逊云服务弹性容器服务与弹性Kubernetes服务)中的容器层与机器层扩缩容。通过多客户端场景下推理时间的均值与标准差,以及向上扩缩容响应时间的研究进行对比分析。基于此分析,我们提出了一种适用于本地和云环境的部署策略。