Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers is exemplified by the fact that Microsoft Azure's Eagle cloud service machine is number three on the November 23 Top 500 list. For cloud services, the HPC application and dependencies are installed in containers, e.g. Docker, Singularity, or something else, and these containers are executed on the physical hardware. Although containerization leverages the existing Linux kernel and should not impose overheads on the computation, there is the possibility that machine-specific optimizations might be lost, particularly machine-specific installs of commonly used packages. In this paper, we will use an astrophysics application using HPX-Kokkos and measure overheads on homogeneous resources, e.g. Supercomputer Fugaku, using CPUs only and on heterogenous resources, e.g. LSU's hybrid CPU and GPU system. We will report on challenges in compiling, running, and using the containers as well as performance performance differences.
翻译:云计算为高性能计算资源提供支持是一个新兴课题。该服务对关注可复现计算的研究人员、需要复杂安装的软件包以及仅需偶尔使用计算资源或不愿自行运行维护超级计算机的企业及研究人员具有重要价值。HPC与容器的关联性通过微软Azure的Eagle云服务机器位列2023年11月Top 500榜单第三名这一事实得以体现。在云服务中,HPC应用及其依赖项被安装在Docker、Singularity等容器中,并在物理硬件上执行。尽管容器化技术利用现有Linux内核理论上不会对计算产生额外开销,但可能存在机器特定优化丢失的风险,尤其是通用软件包的机器特定安装版本。本文将通过采用HPX-Kokkos的天体物理学应用,在同构资源(如仅使用CPU的富岳超级计算机)和异构资源(如LSU的CPU-GPU混合系统)上测量开销。我们将报告在容器编译、运行和使用过程中遇到的挑战,以及性能差异分析。