Real-time supercomputing performance analysis is a critical aspect of evaluating and optimizing computational systems in a dynamic user environment. The operation of supercomputers produce vast quantities of analytic data from multiple sources and of varying types so compiling this data in an efficient matter is critical to the process. MIT Lincoln Laboratory Supercomputing Center has been utilizing the Unity 3D game engine to create a Digital Twin of our supercomputing systems for several years to perform system monitoring. Unity offers robust visualization capabilities making it ideal for creating a sophisticated representation of the computational processes. As we scale the systems to include a diversity of resources such as accelerators and the addition of more users, we need to implement new analysis tools for the monitoring system. The workloads in research continuously change, as does the capability of Unity, and this allows us to adapt our monitoring tools to scale and incorporate features enabling efficient replay of system wide events, user isolation, and machine level granularity. Our system fully takes advantage of the modern capabilities of the Unity Engine in a way that intuitively represents the real time workload performed on a supercomputer. It allows HPC system engineers to quickly diagnose usage related errors with its responsive user interface which scales efficiently with large data sets.
翻译:实时超级计算性能分析是在动态用户环境中评估与优化计算系统的关键环节。超级计算机运行过程中会从多源产生海量类型各异的分析数据,如何高效整合这些数据对分析流程至关重要。麻省理工学院林肯实验室超级计算中心多年来持续利用Unity 3D游戏引擎构建超级计算系统的数字孪生体以进行系统监控。Unity引擎具备强大的可视化功能,特别适合构建计算过程的精细化表征模型。随着系统规模扩展至涵盖加速器等异构资源且用户数量持续增长,我们需要为监控系统部署新型分析工具。研究负载与Unity引擎功能均在持续演进,这使得我们能够调整监控工具以实现规模扩展,并集成系统级事件高效回放、用户隔离及机器级粒度监控等特性。本系统充分发挥Unity引擎的现代技术优势,以直观方式呈现超级计算机的实时工作负载。其响应式用户界面可高效处理大规模数据集,帮助高性能计算系统工程师快速诊断与使用相关的错误。