DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations

Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.

翻译：现代数据中心（DC）承载着人工智能（AI）专用设备，在高功率密度和快速变化的工作负载下运行，这使得分钟级的自适应对于安全且节能的运行至关重要。然而，手动设计分段式深度强化学习（DRL）智能体无法跟上不断演进的DC中频繁的动态变化和服务水平协议（SLA）变更。这种从规范到策略的滞后导致缺乏及时有效的控制策略，可能引发服务中断。为弥合这一差距，我们提出了DCoPilot，一个用于动态DC运营中生成控制策略的混合框架。DCoPilot协同融合了两种不同的生成范式：一个执行结构化奖励形式符号生成的大语言模型（LLM），以及一个执行策略权重参数生成的超网络。DCoPilot通过三个协调阶段运行：（i）仿真扩展，在多样化的仿真就绪（SimReady）场景中对候选奖励进行压力测试；（ii）元策略提炼，训练一个超网络以输出以SLA和场景嵌入为条件的策略权重；（iii）在线自适应，实现响应更新规范时的零样本策略生成。在涵盖不同DC组件的五个控制任务族上的评估表明，DCoPilot实现了近乎零的约束违反，并在所有规范变体上优于所有基线方法。消融研究验证了基于LLM的统一奖励生成在实现超网络稳定收敛方面的有效性。