The emergence of large-scale AI models, like GPT-4, has significantly impacted academia and industry, driving the demand for high-performance computing (HPC) to accelerate workloads. To address this, we present HPCClusterScape, a visualization system that enhances the efficiency and transparency of shared HPC clusters for large-scale AI models. HPCClusterScape provides a comprehensive overview of system-level (e.g., partitions, hosts, and workload status) and application-level (e.g., identification of experiments and researchers) information, allowing HPC operators and machine learning researchers to monitor resource utilization and identify issues through customizable violation rules. The system includes diagnostic tools to investigate workload imbalances and synchronization bottlenecks in large-scale distributed deep learning experiments. Deployed in industrial-scale HPC clusters, HPCClusterScape incorporates user feedback and meets specific requirements. This paper outlines the challenges and prerequisites for efficient HPC operation, introduces the interactive visualization system, and highlights its contributions in addressing pain points and optimizing resource utilization in shared HPC clusters.
翻译:大规模AI模型(如GPT-4)的出现对学术界和工业界产生了显著影响,推动了高性能计算(HPC)加速工作负载的需求。为此,我们提出HPCClusterScape——一种可视化系统,旨在增强面向大规模AI模型的共享HPC集群的透明度与效率。HPCClusterScape提供系统级(如分区、主机及工作负载状态)和应用级(如实验与研究人员识别)信息的全面概览,使HPC运维人员与机器学习研究者能够通过可定制的违规规则监控资源利用率并识别问题。该系统包含诊断工具,可用于调查大规模分布式深度学习实验中的工作负载不均衡与同步瓶颈。HPCClusterScape已部署于工业级HPC集群,融合了用户反馈并满足了特定需求。本文阐述了高效HPC运维的挑战与必要条件,介绍了交互式可视化系统,并重点指出了其在解决痛点及优化共享HPC集群资源利用率方面的贡献。