With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX.
翻译:随着芯片计算强度的提升,计算层形状与可用计算资源之间的失配严重限制了芯片的利用率。基于这一观察,先前研究探讨了空间加速器或数据流架构以最大化吞吐量,但空间加速器的使用可能增加执行延迟。本文首先系统研究了两种执行模型:(1)顺序(时序)启动单一单片加速器,以及(2)空间启动多个加速器。通过观察发现,这两种执行模型之间存在延迟与吞吐量的权衡,将两者结合可获得更高效的延迟-吞吐量帕累托前沿。为实现这一目标,我们提出空间时序架构(SSR)及SSR设计自动化框架,在部署深度学习推理时联合探索两种策略。采用7nm AMD Versal ACAP VCK190开发板,为四种基于Transformer的端到端深度学习模型实现了SSR加速器。与8nm Nvidia GPU A10G、16nm AMD FPGA ZCU102及U250相比,SSR在不同批处理规模下平均吞吐量增益分别达到2.53倍、35.71倍和14.20倍,平均能效增益分别为8.51倍、6.75倍和21.22倍。与VCK190上纯时序方案及纯空间方案相比,本文的空间-时序混合方案在相同延迟要求下实现更高吞吐量,在相同吞吐量要求下实现更低延迟。此外,利用SSR分析模型展示了该架构在其它计算平台(如14nm Intel Stratix 10 NX)上的优化能力。