One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application but often have a steep learning curve and rarely provide an easy to interpret view of resource utilization. Lower level tools such as top and htop provide a view of resource utilization for those familiar and comfortable with Linux but a barrier for newer HPC practitioners. To expand the existing profiling and job monitoring options, the MIT Lincoln Laboratory Supercomputing Center created LLoad, a tool that captures a snapshot of the resources being used by a job on a per user basis. LLload is a tool built from standard HPC tools that provides an easy way for a researcher to track resource usage of active jobs. We explain how the tool was designed and implemented and provide insight into how it is used to aid new researchers in developing their performance monitoring skills as well as guide researchers in their resource requests.
翻译:高性能计算(HPC)系统的用户面临的一项较为复杂的任务是对其应用程序进行性能监控与调优。培养持续的性能改进实践——无论是为了加速还是为了高效利用资源——对于HPC从业者和研究项目的长期成功都至关重要。性能剖析工具能很好地展示应用程序的性能,但通常学习曲线陡峭,且很少能提供易于解读的资源利用率视图。对于熟悉并适应Linux的用户,诸如top和htop等底层工具可以提供资源利用率视图,但这对于HPC新手而言却构成了一道门槛。为了扩展现有的性能剖析与作业监控选项,MIT林肯实验室超级计算中心开发了LLoad,该工具能够按用户捕获作业正在使用资源的快照。LLoad是一个基于标准HPC工具构建的工具,为研究人员提供了一种便捷的方式来跟踪活跃作业的资源使用情况。我们将阐述该工具的设计与实现方式,并深入说明如何利用它来帮助新研究人员发展其性能监控技能,以及指导研究人员进行资源请求。