Two-tiered Online Optimization of Region-wide Datacenter Resource Allocation via Deep Reinforcement Learning

Chang-Lin Chen,Hanhan Zhou,Jiayu Chen,Mohammad Pedramfar,Vaneet Aggarwal,Tian Lan,Zheqing Zhu,Chi Zhou,Tim Gasser,Pol Mauri Ruiz,Vijay Menon,Neeraj Kumar,Hongbo Dong

This paper addresses the important need for advanced techniques in continuously allocating workloads on shared infrastructures in data centers, a problem arising due to the growing popularity and scale of cloud computing. It particularly emphasizes the scarcity of research ensuring guaranteed capacity in capacity reservations during large-scale failures. To tackle these issues, the paper presents scalable solutions for resource management. It builds on the prior establishment of capacity reservation in cluster management systems and the two-level resource allocation problem addressed by the Resource Allowance System (RAS). Recognizing the limitations of Mixed Integer Linear Programming (MILP) for server assignment in a dynamic environment, this paper proposes the use of Deep Reinforcement Learning (DRL), which has been successful in achieving long-term optimal results for time-varying systems. A novel two-level design that utilizes a DRL-based algorithm is introduced to solve optimal server-to-reservation assignment, taking into account of fault tolerance, server movement minimization, and network affinity requirements due to the impracticality of directly applying DRL algorithms to large-scale instances with millions of decision variables. The paper explores the interconnection of these levels and the benefits of such an approach for achieving long-term optimal results in the context of large-scale cloud systems. We further show in the experiment section that our two-level DRL approach outperforms the MIP solver and heuristic approaches and exhibits significantly reduced computation time compared to the MIP solver. Specifically, our two-level DRL approach performs 15% better than the MIP solver on minimizing the overall cost. Also, it uses only 26 seconds to execute 30 rounds of decision making, while the MIP solver needs nearly an hour.

翻译：本文针对云计算日益普及和规模扩大所引发的共享基础设施上工作负载持续分配问题，提出了对先进技术的迫切需求。特别强调了大规模故障期间确保容量预留保证这一研究空白。为解决这些问题，本文提出了可扩展的资源管理方案。该方案建立在集群管理系统容量预留机制及资源配额系统（RAS）处理的双层资源分配问题基础上。鉴于混合整数线性规划（MILP）在动态环境下服务器分配的局限性，本文提出采用深度强化学习（DRL）方法——该方法已被证明能有效实现时变系统的长期最优结果。针对直接应用DRL算法处理含数百万决策变量的大规模实例的不可行性，本文创新性地设计了基于DRL算法的双层架构，在考虑容错性、服务器迁移最小化及网络亲和性需求的前提下，求解最优服务器-预留分配问题。论文探讨了各层级间的相互关联性，以及该方法在大规模云系统中实现长期最优结果的优势。实验部分进一步表明，我们的双层DRL方法在性能上优于MIP求解器和启发式方法，且计算时间较MIP求解器显著缩短。具体而言，在最小化总体成本方面，双层DRL方法较MIP求解器提升15%性能；在完成30轮决策时仅需26秒，而MIP求解器需要近1小时。