With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX.
翻译:随着芯片计算密度的提升,计算层形状与可用计算资源之间的不匹配严重限制了芯片利用率。基于此观察,已有研究探讨了空间加速器或数据流架构以最大化吞吐量。然而,使用空间加速器可能增加执行延迟。本文首先系统性地研究了两种执行模型:(1) 顺序(时间上)启动单一加速器;(2) 空间上启动多个加速器。通过观察,我们发现这两种执行模型之间存在延迟-吞吐量权衡,将两种策略结合能实现更高效的延迟-吞吐量帕累托前沿。为此,我们提出空间-序列架构(SSR)及其自动化设计框架,以在部署深度学习推理时联合探索这两种策略。我们采用7nm AMD Versal ACAP VCK190板卡,为四个端到端基于Transformer的深度学习模型实现了SSR加速器。在不同批处理规模下,SSR相比8nm Nvidia GPU A10G、16nm AMD FPGA ZCU102和U250,平均吞吐量提升达2.53倍、35.71倍和14.20倍;平均能效提升分别为8.51倍、6.75倍和21.22倍。与VCK190上仅采用序列方案和仅采用空间方案相比,我们的空间-序列混合方案在相同延迟需求下实现更高吞吐量,并在相同吞吐量需求下实现更低延迟。我们还利用SSR分析模型演示了如何将SSR应用于其他计算平台(如14nm Intel Stratix 10 NX)的优化方案。