The ability to monitor and interpret of hardware system events and behaviors are crucial to improving the robustness and reliability of these systems, especially in a supercomputing facility. The growing complexity and scale of these systems demand an increase in monitoring data collected at multiple fidelity levels and varying temporal resolutions. In this work, we aim to build a holistic analytical system that helps make sense of such massive data, mainly the hardware logs, job logs, and environment logs collected from disparate subsystems and components of a supercomputer system. This end-to-end log analysis system, coupled with visual analytics support, allows users to glean and promptly extract supercomputer usage and error patterns at varying temporal and spatial resolutions. We use multiresolution dynamic mode decomposition (mrDMD), a technique that depicts high-dimensional data as correlated spatial-temporal variations patterns or modes, to extract variation patterns isolated at specified frequencies. Our improvements to the mrDMD algorithm help promptly reveal useful information in the massive environment log dataset, which is then associated with the processed hardware and job log datasets using our visual analytics system. Furthermore, our system can identify the usage and error patterns filtered at user, project, and subcomponent levels. We exemplify the effectiveness of our approach with two use scenarios with the Cray XC40 supercomputer.
翻译:监控和解释硬件系统事件与行为的能力对于提高这些系统的鲁棒性和可靠性至关重要,尤其是在超级计算设施中。随着这些系统复杂性和规模的不断增长,需要在多个保真度级别和不同时间分辨率下收集更多的监控数据。在本工作中,我们旨在构建一个全面的分析系统,以帮助理解这类海量数据,主要包括从超级计算机系统的不同子系统和组件中收集的硬件日志、作业日志和环境日志。这一端到端的日志分析系统结合了视觉分析支持,使用户能够在不同时间和空间分辨率下快速获取并提取超级计算机的使用模式与错误模式。我们采用多分辨率动态模式分解(mrDMD)技术——一种将高维数据表示为相关时空变化模式的方法,来提取特定频率下的孤立变化模式。我们对mrDMD算法的改进有助于快速揭示海量环境日志数据集中的有用信息,并通过我们的视觉分析系统将这些信息与处理后的硬件和作业日志数据集相关联。此外,我们的系统能够识别在用户、项目和子组件级别过滤后的使用和错误模式。我们通过两个基于Cray XC40超级计算机的使用场景,展示了该方法的效果。