High performance is needed in many computing systems, from batch-managed supercomputers to general-purpose cloud platforms. However, scientific clusters lack elastic parallelism, while clouds cannot offer competitive costs for high-performance applications. In this work, we investigate how modern cloud programming paradigms can bring the elasticity needed to allocate idle resources, decreasing computation costs and improving overall data center efficiency. Function-as-a-Service (FaaS) brings the pay-as-you-go execution of stateless functions, but its performance characteristics cannot match coarse-grained cloud and cluster allocations. To make serverless computing viable for high-performance and latency-sensitive applications, we present rFaaS, an RDMA-accelerated FaaS platform. We identify critical limitations of serverless - centralized scheduling and inefficient network transport - and improve the FaaS architecture with allocation leases and microsecond invocations. We show that our remote functions add only negligible overhead on top of the fastest available networks, and we decrease the execution latency by orders of magnitude compared to contemporary FaaS systems. Furthermore, we demonstrate the performance of rFaaS by evaluating real-world FaaS benchmarks and parallel applications. Overall, our results show that new allocation policies and remote memory access help FaaS applications achieve high performance and bring serverless computing to HPC.
翻译:高性能是众多计算系统的核心需求,涵盖批处理管理的超级计算机与通用云平台。然而,科学集群缺乏弹性并行能力,而云平台又难以为高性能应用提供有竞争力的成本。本研究探索现代云编程范式如何通过弹性资源分配来释放闲置计算能力,从而降低计算成本并提升数据中心整体效率。函数即服务(FaaS)实现了无状态函数的按需付费执行,但其性能特征无法匹敌粗粒度云和集群资源分配方案。为使无服务器计算适用于高性能与延迟敏感型应用,我们提出rFaaS——基于RDMA加速的FaaS平台。我们识别出无服务器架构的关键瓶颈(集中式调度与低效网络传输),并通过分配租约与微秒级调用机制优化FaaS架构。实验表明,我们的远程函数在现有最快网络基础上仅引入可忽略的开销,且执行延迟较当代FaaS系统降低数个数量级。此外,我们通过真实FaaS基准测试与并行应用验证了rFaaS的性能。总体而言,本研究证明新型分配策略与远程内存访问技术有助于FaaS应用实现高性能,并将无服务器计算推入高性能计算领域。