The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying opportunities for better utilization of compute resources. LLload can be used to monitor jobs both programmatically and interactively. LLload can characterize users' jobs using various LLload options to achieve better efficiency. This information can be used to inform the user to optimize HPC workloads and improve both CPU and GPU utilization. This includes improvements using judicious oversubscription of the computing resources. Preliminary results suggest significant improvement in GPU utilization and overall throughput performance with GPU overloading in some cases. By enabling users to observe and fix incorrect job submission and/or inappropriate execution setups, LLload can increase the resource usage and improve the overall throughput performance. LLload is a light-weight, easy-to-use tool for both HPC users and HPC systems engineers to monitor HPC workloads to improve system utilization and efficiency.
翻译:随着高性能计算(HPC)应用日益广泛且成本不断攀升,亟需新型易用工具来帮助HPC用户和系统工程师清晰理解资源利用状况。麻省理工学院林肯实验室超级计算中心(LLSC)开发了一个简洁的命令行工具LLload,用于监控和表征HPC工作负载。该工具在识别计算资源优化利用机会方面发挥着重要作用。用户可通过编程方式或交互模式使用LLload监控作业任务。通过配置多种LLload选项对用户作业进行特征分析,可实现更高效的资源调度。所得信息可用于指导用户优化HPC工作负载,提升CPU与GPU利用率,包括通过合理的资源超配策略实现效能提升。初步实验结果表明,在某些场景下通过GPU过载技术可显著提升GPU利用率和整体吞吐性能。通过帮助用户发现并修正错误的作业提交或不当的执行配置,LLload能够提高资源使用率并优化整体吞吐性能。作为一款轻量级易用工具,LLload为HPC用户和系统工程师提供了监控工作负载的有效手段,从而提升系统利用率和运行效率。