Kubernetes (k8s) has the potential to coordinate distributed edge resources and centralized cloud resources, but currently lacks a specialized scheduling framework for edge-cloud networks. Besides, the hierarchical distribution of heterogeneous resources makes the modeling and scheduling of k8s-oriented edge-cloud network particularly challenging. In this paper, we introduce KaiS, a learning-based scheduling framework for such edge-cloud network to improve the long-term throughput rate of request processing. First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch and dynamic dispatch spaces within the edge cluster. Second, for diverse system scales and structures, we use graph neural networks to embed system state information, and combine the embedding results with multiple policy networks to reduce the orchestration dimensionality by stepwise scheduling. Finally, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration, and present the implementation design of deploying the above algorithms compatible with native k8s components. Experiments using real workload traces show that KaiS can successfully learn appropriate scheduling policies, irrespective of request arrival patterns and system scales. Moreover, KaiS can enhance the average system throughput rate by 15.9% while reducing scheduling cost by 38.4% compared to baselines.
翻译:Kubernetes(k8s)具备协调分布式边缘资源与集中式云资源的潜力,但当前缺乏针对边缘-云网络的专用调度框架。此外,异构资源的层次化分布使得面向k8s的边缘-云网络建模与调度极具挑战性。本文提出KaiS——一种面向此类边缘-云网络的基于学习的调度框架,旨在提升请求处理的长期吞吐率。首先,我们设计了一种协调式多智能体actor-critic算法,以适配边缘集群内的去中心化请求分发与动态分发空间。其次,针对多样化的系统规模与结构,我们采用图神经网络嵌入系统状态信息,并将嵌入结果与多策略网络相结合,通过逐步调度降低编排维度。最后,我们引入双时间尺度调度机制以协调请求分发与服务编排,并给出了上述算法与原生k8s组件兼容的部署实现方案。基于真实工作负载轨迹的实验表明:无论请求到达模式与系统规模如何,KaiS均能成功学习合适的调度策略。此外,与基准方法相比,KaiS在将调度成本降低38.4%的同时,使系统平均吞吐率提升15.9%。