AI data centers experience rapid fluctuations in power demand due to the heterogeneity of computational tasks that they have to support. For example, the power profile of inference and training of large language models (LLMs) is quite distinct and big divergences can result in the instability of the underlying electricity grid. In this paper we propose, to the best of our knowledge, the first physics-informed DLinear time-series model that can accurately forecast power utilization of an AI data center 5-80 minutes (short-term forecasting) into the future. The physics, based on a multi-node lumped thermal resistance-capacitance (RC) network consistent with Newton's law of cooling, is captured using newly derived time-dependent ordinary differential equations (ODE) that separately models and interlinks power consumption with the GPU compute and memory utilization and temperature. The resulting model, that we refer to as PI-DLinear, trained and evaluated on a real AI data center dataset and is not only more accurate than the state-of-the-art (SOTA) models tested, but the forecast profile respects the underlying physics under power throttling and load transient events. Relative to the SOTA transformer-based and non-transformer-based models, improvements in forecasting accuracy (averaged across all look-back and prediction windows) range from 0.782%-39.08% for MSE, 0.993%-51.82% for MAE, and 0.370%-22.28% for RMSE.
翻译:AI数据中心因需支持异构计算任务,其功耗需求呈现快速波动特性。例如,大语言模型(LLMs)的推理与训练过程具有显著不同的功耗特征,这种剧烈差异可能导致底层电网的不稳定性。本文提出了一种基于物理信息驱动的DLinear时序模型(据我们所知为首次),能够对AI数据中心未来5-80分钟(短期预测)的功耗利用率进行精确预测。该物理机制基于符合牛顿冷却定律的多节点集总热阻-电容(RC)网络,通过新推导的时变常微分方程(ODE)实现,该方程独立建模并关联GPU计算、内存利用率及温度与功耗的关系。所提出的模型(命名为PI-DLinear)基于真实AI数据中心数据集进行训练与评估,不仅比现有最优(SOTA)模型具有更高精度,其预测曲线在功率限制与负载瞬态事件下仍遵循底层物理规律。与基于Transformer和非Transformer的SOTA模型相比,其预测精度(在所有回溯窗口与预测窗口上的平均值)在MSE、MAE和RMSE指标上分别提升0.782%-39.08%、0.993%-51.82%和0.370%-22.28%。