Aggregated HPC resources have rigid allocation systems and programming models which struggle to adapt to diverse and changing workloads. Consequently, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses. In this paper, we improve the utilization of supercomputers by employing the new cloud paradigm of serverless computing. We show how serverless functions provide fine-grained access to the resources of batch-managed cluster nodes. We present an HPC-oriented Function-as-a-Service (FaaS) that satisfies the requirements of high-performance applications. We demonstrate a software resource disaggregation approach where placing functions on unallocated and underutilized nodes allows idle cores and accelerators to be utilized while retaining near-native performance.
翻译:聚合式高性能计算资源具有僵化的分配系统与编程模型,难以适应多样且动态变化的工作负载。因此,高性能计算系统无法有效利用大量闲置内存,也难以提升空闲计算资源的利用率。先前的研究尝试通过工作负载共置与资源解耦来提高超级计算系统的吞吐量与效率。然而,这些方法均未能提供一种无需重大硬件改造且不造成性能损失的、适用于现有系统的解决方案。本文通过采用新兴的无服务器计算云范式来提升超级计算机的利用率。我们阐述了无服务器函数如何为批处理管理的集群节点资源提供细粒度访问接口。我们提出了一种面向高性能计算的函数即服务框架,该框架能够满足高性能应用的需求。我们展示了一种软件资源解耦方法:通过在未分配节点与低负载节点上部署函数,可在保持接近原生性能的同时,有效利用空闲的计算核心与加速器资源。