Determining the maximum usage of random-access memory (RAM) on both the motherboard and on a graphical processing unit (GPU) over the lifetime of a computing task can be extremely useful for troubleshooting points of failure as well as optimizing memory utilization, especially within a high-performance computing (HPC) setting. While there are tools for tracking compute time and RAM, including by job management tools themselves, tracking of GPU usage, to our knowledge, does not currently have sufficient solutions. We present gpu_tracker, a Python package that tracks the computational resource usage of a task while running in the background, including the real compute time that the task takes to complete, its maximum RAM usage, and the maximum GPU RAM usage, specifically for Nvidia GPUs. We demonstrate that gpu_tracker can seamlessly track computational resource usage with minimal overhead, both within desktop and HPC execution environments.
翻译:确定计算任务生命周期中主板和图形处理器(GPU)上随机存取存储器(RAM)的最大使用量,对于故障排查和内存利用率优化极为有用,尤其是在高性能计算(HPC)环境中。尽管存在追踪计算时间和RAM的工具(包括任务管理工具本身提供的功能),但据我们所知,目前针对GPU使用情况的追踪尚无完善的解决方案。本文提出gpu_tracker——一个Python包,可在后台运行的同时追踪任务的计算资源使用情况,包括任务完成所需的实际计算时间、最大RAM使用量以及最大GPU RAM使用量(特别针对Nvidia GPU)。我们证明,gpu_tracker能够在桌面和HPC执行环境中以极小的开销无缝追踪计算资源使用情况。