Supercomputers are complex, dynamic systems that serve thousands of users and are built with thousands of compute nodes. Due to the vast amounts of system and performance data needed to accurately capture their status, supercomputers require complex methods to monitor, maintain, and optimize. Data visualization is a powerful technique for overseeing these large streams of data in an easily interpretable way. The MIT Lincoln Laboratory Supercomputing Center (LLSC) enables effective monitoring through combining 3D gaming technology with compound data streams in the TX-Digital Twin, a 3D simulation of the supercomputer. The TX-Digital Twin offers both live and historical data, in visual and text formats, and tracks a multitude of revealing performance metrics. Recent increasing interest in GPU-accelerated computing has driven a need for monitoring and maintenance of GPU-accelerated resources in supercomputers. In this paper, we build on our previous solution by integrating the visualization of additional GPU metrics, such as GPU memory usage, temperature, and power draw, into the TX-Digital Twin. Using techniques in draw call optimization, we add clear and effective displays of the new metrics while keeping the effects on performance minimal.
翻译:超级计算机是复杂的动态系统,服务于数千名用户,并由数千个计算节点构成。由于准确捕捉其状态所需的系统和性能数据量庞大,超级计算机需要采用复杂的方法进行监控、维护和优化。数据可视化是一种强大的技术,能够以易于解读的方式监督这些海量数据流。麻省理工学院林肯实验室超级计算中心(LLSC)通过将3D游戏技术与复合数据流相结合,在TX-数字孪生(一种超级计算机的三维模拟)中实现了有效的监控。TX-数字孪生以可视化和文本格式提供实时和历史数据,并追踪众多揭示性能的指标。近年来,对GPU加速计算的兴趣日益增长,推动了对超级计算机中GPU加速资源进行监控和维护的需求。在本文中,我们在先前解决方案的基础上,将额外的GPU指标(如GPU内存使用率、温度和功耗)的可视化集成到TX-数字孪生中。通过采用绘制调用优化技术,我们在保持对性能影响最小化的同时,增加了这些新指标的清晰有效显示。