FPGAs are rarely mentioned when discussing the implementation of large machine learning applications, such as Large Language Models (LLMs), in the data center. There has been much evidence showing that single FPGAs can be competitive with GPUs in performance for some computations, especially for low latency, and often much more efficient when power is considered. This suggests that there is merit to exploring the use of multiple FPGAs for large machine learning applications. The challenge with using multiple FPGAs is that there is no commonly-accepted flow for developing and deploying multi-FPGA applications, i.e., there are no tools to describe a large application, map it to multiple FPGAs and then deploy the application on a multi-FPGA platform. In this paper, we explore the feasibility of implementing large transformers using multiple FPGAs by developing a scalable multi-FPGA platform and some tools to map large applications to the platform. We validate our approach by designing an efficient multi-FPGA version of the I-BERT transformer and implement one encoder using six FPGAs as a working proof-of-concept to show that our platform and tools work. Based on our proof-of-concept prototype and the estimations of performance using the latest FPGAs compared to GPUs, we conclude that there can be a place for FPGAs in the world of large machine learning applications. We demonstrate a promising first step that shows that with the right infrastructure and tools it is reasonable to continue to explore the possible benefits of using FPGAs for applications such as LLMs.
翻译:FPGA在讨论数据中心部署大型机器学习应用(如大语言模型)时鲜少被提及。大量证据表明,单个FPGA在某些计算场景(尤其是低延迟需求)中可在性能上与GPU竞争,且在功耗方面往往更具优势。这表明探索将多块FPGA用于大型机器学习应用具有价值。然而,使用多块FPGA的挑战在于缺乏通用的开发与部署流程——即缺少描述大型应用、将其映射至多块FPGA并在多FPGA平台部署的工具。本文通过开发可扩展的多FPGA平台及映射大型应用的若干工具,探究了利用多块FPGA实现大规模Transformer的可行性。我们设计了高效的多FPGA版I-BERT Transformer,并以六块FPGA实现一个编码器作为工作概念验证,验证了平台与工具的有效性。基于概念验证原型及采用最新FPGA相比GPU的性能估算,我们得出结论:FPGA可在大型机器学习应用领域占据一席之地。本研究展示了富有前景的第一步,表明在具备合适基础设施与工具的条件下,继续探索FPGA用于LLM等应用的可能性是合理的。