AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.

翻译：最先进的云原生应用需要能够有效平衡系统稳定性、资源利用率和相关成本的智能调度器。虽然Kubernetes默认提供基于可行性（feasibility）的放置策略，但近期研究已探索使用强化学习（RL）实现更智能的调度决策。然而，现有基于RL的调度器存在三大主要局限：第一，多数此类调度器采用单体化集中式智能体，难以扩展至大规模异构集群；第二，采用多目标奖励函数的调度器假设目标间为简单、静态的线性组合关系；第三，尚无研究提出能够自适应动态条件、具有压力感知（stress-aware）能力的调度器。为弥补这些研究空白，我们提出自适应图增强多智能体强化学习动态Kubernetes调度器（AGMARL-DKS）。AGMARL-DKS通过三项主要创新解决了上述问题：首先，我们将调度挑战建模为协作式多智能体问题，使每个集群节点充当独立智能体，并采用集中式训练后分散式执行的框架，从而构建可扩展解决方案；其次，为实现上下文感知（context-aware）且保持分散式特性，我们在每个智能体处采用图神经网络（GNN）构建全局集群上下文的状态表征，这优于仅依赖局部观测的方法；最后，为平衡多个优化目标，我们采用基于压力感知的字典序排序策略（lexicographical ordering policy），而非简单的静态目标线性加权。在Google Kubernetes Engine（GKE）上的评估表明，AGMARL-DKS在容错性、资源利用率和成本方面显著优于默认调度器，尤其在批处理与关键任务负载的调度场景中表现突出。