Machine learning (ML) accelerators have been studied and used extensively to compute ML models with high performance and low power. However, designing such accelerators normally takes a long time and requires significant effort. Unfortunately, the pace of development of ML software models is much faster than the accelerator design cycle, leading to frequent and drastic modifications in the model architecture, thus rendering many accelerators obsolete. Existing design tools and frameworks can provide quick accelerator prototyping, but only for a limited range of models that can fit into a single hardware device, such as an FPGA. Furthermore, with the emergence of large language models, such as GPT-3, there is an increased need for hardware prototyping of these large models within a many-accelerator system to ensure the hardware can scale with the ever-growing model sizes. In this paper, we propose an efficient and scalable approach for exploring accelerator systems to compute large ML models. We developed a tool named MASE that can directly map large ML models onto an efficient streaming accelerator system. Over a set of ML models, we show that MASE can achieve better energy efficiency to GPUs when computing inference for recent transformer models. Our tool will open-sourced upon publication.
翻译:机器学习(ML)加速器已被广泛研究和应用于以高性能和低功耗计算ML模型。然而,设计此类加速器通常耗时且需要大量精力。不幸的是,ML软件模型的发展速度远快于加速器设计周期,导致模型架构频繁且剧烈地修改,使得许多加速器过时。现有设计工具和框架可提供快速加速器原型设计,但仅限于可部署在单一硬件设备(如FPGA)上的有限模型范围。此外,随着GPT-3等大型语言模型的出现,在多加速器系统中对这些大模型进行硬件原型设计的需求日益增长,以确保硬件能够适应不断增长的模型规模。本文提出了一种高效且可扩展的方法,用于探索计算大型ML模型的加速器系统。我们开发了名为MASE的工具,它能够将大型ML模型直接映射到高效的流式加速器系统。在一组ML模型上的实验表明,在计算最新Transformer模型的推理任务时,MASE相比GPU能实现更优的能源效率。我们的工具将在论文发表后开源。