FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.
翻译:FPGA在云端部署中日益普及,可作为智能网卡或网络附加加速器使用。尽管潜力巨大,但由于缺乏适当的基础设施和通信抽象机制,开发分布式FPGA加速应用仍较为繁琐。为促进基于FPGA的分布式应用开发,本文提出ACCL+——一个开源、多功能的基于FPGA的集合通信库。该库可跨平台移植,支持UDP、TCP及RDMA协议,使FPGA应用能够发起FPGA到FPGA的直接集合通信。此外,它可作为CPU应用的集合卸载引擎,将CPU从网络任务中解放。ACCL+具有用户可扩展性,允许在不重新综合FPGA电路的情况下实现并部署新的集合操作。我们在配备100 Gb/s网络的FPGA集群上评估了ACCL+,并将其性能与基于RDMA的软件MPI进行对比。结果表明,ACCL+在基于FPGA的分布式应用中具有显著优势,并在CPU应用中展现出极具竞争力的性能。我们通过两个用例展示其双重角色:一是无缝集成作为集合卸载引擎以加速基于CPU的向量矩阵乘法,二是作为关键高效组件用于设计完全基于FPGA的分布式深度学习推荐推理系统。