Datacenters are the backbone of our digital society, but raise numerous operational challenges. We envision digital twins becoming primary instruments in datacenter operations, continuously and autonomously helping with major operational decisions and with adapting ICT infrastructure, live, with a human-in-the-loop. Although fields such as aviation and autonomous driving successfully employ digital twins, an open-source digital twin for datacenters has not been demonstrated to the community. Addressing this challenge, we design, implement, and experiment using OpenDT, an Open-source, Digital Twin for monitoring and operating datacenters through a continuous integration cycle that includes: (1) live and continuous telemetry data; (2) discrete-event simulation using live telemetry from the physical ICT, with self-calibration; and (3) SLO-aware and human-approved feedback to physical ICT. Through trace-driven experiments with a prototype mainly covering stages 1 and 2 of the cycle, we show that (i) OpenDT can be used to reproduce peer-reviewed experiments and extend the analysis with performance and energy-efficiency results; (ii) OpenDT's online re-calibration can increase digital-twinning accuracy, quantified to a MAPE of 4.39% vs. 7.86% in peer-reviewed work. OpenDT adheres to FAIR/FOSS principles and is available at: https://github.com/atlarge-research/opendt/tree/hcp.
翻译:数据中心是数字社会的基石,但也面临诸多运维挑战。我们设想数字孪生将成为数据中心运维的核心工具,通过人机协作机制,持续自主地协助重大运维决策,并实时调整ICT基础设施。尽管航空、自动驾驶等领域已成功应用数字孪生技术,但面向数据中心的开源数字孪生方案尚未向学界展示。为应对这一挑战,我们设计、实现并实验验证了OpenDT——一个通过持续集成循环监测与运维数据中心的开源数字孪生系统。该循环包含三个环节:(1)实时持续遥测数据采集;(2)基于物理ICT系统实时遥测数据的自校准离散事件仿真;(3)服务等级协议(SLO)感知且经人工核准的物理ICT反馈。通过基于痕迹驱动的原型实验(主要覆盖循环第1、2阶段),我们验证:(i)OpenDT既能复现同行评审实验,又能通过性能与能效分析扩展研究维度;(ii)OpenDT的在线重校准可提升数字孪生精度,其平均绝对百分比误差(MAPE)为4.39%,优于同行评审工作的7.86%。OpenDT遵循FAIR/FOSS原则,代码开源:https://github.com/atlarge-research/opendt/tree/hcp。