By 2025, there are zettabytes of data generated every year. The size and complexity of modern large-scale computing infrastructures like High-Performance Computing (HPC) systems continue to evolve and become complex, leaving us wondering about their manageability and sustainability concerns. Because of this reason, those complex systems are provided with fine-grained monitoring and Operational Data Analytics (ODA) capabilities to optimise their efficiency. In this literature study, we list the fundamental pillars of the large-scale computing infrastructures which enable its ODA capabilities, and conduct a study of the popular ODA frameworks operating in various such environments (predominantly HPC). Based on that, we propose a more holistic ODA framework matching the various layers of a large-scale graph-processing distributed ecosystem proposed by Sherif Sak et al, that extends the ODA functionalities presented in an existing novel ODA framework proposed by Netti et al. We compare the holistic ODA framework proposed by us to some of the state-of-the-art frameworks that we study as part of this literature to highlight the novelty, which would hopefully draw more attention to perform extensive research in this field. As part of creating awareness, we highlight the significant operational efficiencies observed as a result of the implementation of the state-of-the-art ODA frameworks to make the study appear beneficial for the readers, and lastly, discuss the trending research work ongoing in this field.
翻译:到2025年,每年将产生泽字节级数据。现代大规模计算基础设施(如高性能计算系统)的规模和复杂度持续演进,其可管理性与可持续性问题引发关注。为此,此类复杂系统配备了细粒度监控和运维数据分析能力以优化运行效率。本文通过文献研究,首先梳理了支撑大规模计算基础设施实现运维数据分析能力的核心支柱,继而针对运行于多种环境(以高性能计算为主)中的主流运维数据分析框架展开系统研究。基于此,我们提出了一个更全面的运维数据分析框架,该框架适配Sherif Sak等人提出的大规模图处理分布式生态系统的多层架构,同时扩展了Netti等人现有创新框架中的运维数据分析功能。我们将所提出的全维框架与本研究涉及的部分前沿框架进行对比,以突出其创新价值,有望吸引更多学者在该领域开展深入研究。为增强认知,我们重点阐述了前沿运维数据分析框架实施带来的显著运行效率提升,使研究对读者更具参考价值,最后讨论了该领域当前的研究热点。