Practice of Alibaba Cloud on Elastic Resource Provisioning for Large-scale Microservices Cluster

Cloud-native architecture is becoming increasingly crucial for today's cloud computing environments due to the need for speed and flexibility in developing applications. It utilizes microservice technology to break down traditional monolithic applications into light-weight and self-contained microservice components. However, as microservices grow in scale and have dynamic inter-dependencies, they also pose new challenges in resource provisioning that cannot be fully addressed by traditional resource scheduling approaches. The various microservices with different resource needs and latency requirements can create complex calling chains, making it difficult to provide fine-grained and accurate resource allocation to each component while maintaining the overall quality of service in the chain. In this work, we aim to address the research problem on how to efficiently provision resources for the growing scale of microservice platform and ensure the performance of latency-critical microservices. To address the problem, we present in-depth analyses of Alibaba's microservice cluster and propose optimized resource provisioning algorithms to enhance resource utilization while ensuring the latency requirement. First, we analyze the distinct features of microservices in Alibaba's cluster compared to traditional applications. Then we present Alibaba's resource capacity provisioning workflow and framework to address challenges in resource provisioning for large-scale and latency-critical microservice clusters. Finally, we propose enhanced resource provisioning algorithms over Alibaba's current practice by making both proactive and reactive scheduling decisions based on different workloads patterns, which can improve resource usage by 10-15% in Alibaba's clusters, while maintaining the necessary latency for microservices.

翻译：云原生架构因应用开发对速度与灵活性的需求，正成为当今云计算环境中日益关键的技术。它利用微服务技术将传统单体应用拆解为轻量级、自包含的微服务组件。然而，随着微服务规模增长且具有动态相互依赖关系，其在资源供给方面也带来了新挑战，传统资源调度方法无法完全解决。不同资源需求与延迟要求的多样化微服务会形成复杂调用链，使得在为链中每个组件提供精细准确的资源分配的同时，维持整体服务质量变得困难。本研究旨在解决如何高效地为日益增长的微服务平台供给资源，并确保延迟敏感型微服务性能这一研究问题。为此，我们对阿里巴巴微服务集群进行了深入分析，并提出了优化的资源供给算法，以在保障延迟要求的同时提升资源利用率。首先，我们分析了阿里巴巴集群中微服务与传统应用相比的显著特征。接着，我们介绍了阿里巴巴为应对大规模、延迟敏感型微服务集群资源供给挑战而设计的资源容量供给工作流与框架。最后，我们在阿里巴巴现有实践基础上提出了增强型资源供给算法，该算法根据不同工作负载模式做出主动式和反应式调度决策，能在维持微服务所需延迟的前提下，将阿里巴巴集群的资源使用率提升10-15%。