With the rise of cloud computing and lightweight containers, Docker has emerged as a leading technology for rapid service deployment, with Kubernetes responsible for pod orchestration. However, for compute-intensive workloads-particularly web services executing containerized machine-learning training-the default Kubernetes scheduler does not always achieve optimal placement. To address this, we propose two custom, reinforcement-learning-based schedulers, SDQN and SDQN-n, both built on the Deep Q-Network (DQN) framework. In compute-intensive scenarios, these models outperform the default Kubernetes scheduler as well as Transformer-and LSTM-based alternatives, reducing average CPU utilization per cluster node by 10%, and by over 20% when using SDQN-n. Moreover, our results show that SDQN-n approach of consolidating pods onto fewer nodes further amplifies resource savings and helps advance greener, more energy-efficient data centers.Therefore, pod scheduling must employ different strategies tailored to each scenario in order to achieve better performance.Since the reinforcement-learning components of the SDQN and SDQN-n architectures proposed in this paper can be easily tuned by adjusting their parameters, they can accommodate the requirements of various future scenarios.
翻译:随着云计算和轻量级容器的兴起,Docker已成为快速服务部署的主导技术,而Kubernetes则负责Pod编排。然而,对于计算密集型工作负载——特别是执行容器化机器学习训练的Web服务——默认的Kubernetes调度器并不总能实现最优部署。为此,我们提出了两种基于强化学习的自定义调度器SDQN和SDQN-n,二者均建立在深度Q网络(DQN)框架之上。在计算密集型场景中,这些模型的表现优于默认Kubernetes调度器以及基于Transformer和LSTM的替代方案,将集群节点的平均CPU利用率降低了10%,使用SDQN-n时降幅更超过20%。此外,我们的结果表明,SDQN-n将Pod整合到更少节点上的策略进一步放大了资源节约效果,有助于推进更绿色、更节能的数据中心建设。因此,Pod调度必须针对不同场景采用定制化策略以获得更优性能。由于本文提出的SDQN和SDQN-n架构中的强化学习组件可通过调整参数轻松调优,它们能够适应未来各种场景的需求。