Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.
翻译:现代数据中心(DC)承载着人工智能(AI)专用设备,在高功率密度和快速变化的工作负载下运行,这使得分钟级的自适应对于安全且节能的运行至关重要。然而,手动设计分段式深度强化学习(DRL)智能体无法跟上不断演进的DC中频繁的动态变化和服务水平协议(SLA)变更。这种从规范到策略的滞后导致缺乏及时有效的控制策略,可能引发服务中断。为弥合这一差距,我们提出了DCoPilot,一个用于动态DC运营中生成控制策略的混合框架。DCoPilot协同融合了两种不同的生成范式:一个执行结构化奖励形式符号生成的大语言模型(LLM),以及一个执行策略权重参数生成的超网络。DCoPilot通过三个协调阶段运行:(i)仿真扩展,在多样化的仿真就绪(SimReady)场景中对候选奖励进行压力测试;(ii)元策略提炼,训练一个超网络以输出以SLA和场景嵌入为条件的策略权重;(iii)在线自适应,实现响应更新规范时的零样本策略生成。在涵盖不同DC组件的五个控制任务族上的评估表明,DCoPilot实现了近乎零的约束违反,并在所有规范变体上优于所有基线方法。消融研究验证了基于LLM的统一奖励生成在实现超网络稳定收敛方面的有效性。