SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX.

翻译：随着芯片计算密度的提升，计算层形状与可用计算资源之间的不匹配严重限制了芯片利用率。基于此观察，已有研究探讨了空间加速器或数据流架构以最大化吞吐量。然而，使用空间加速器可能增加执行延迟。本文首先系统性地研究了两种执行模型：(1) 顺序（时间上）启动单一加速器；(2) 空间上启动多个加速器。通过观察，我们发现这两种执行模型之间存在延迟-吞吐量权衡，将两种策略结合能实现更高效的延迟-吞吐量帕累托前沿。为此，我们提出空间-序列架构（SSR）及其自动化设计框架，以在部署深度学习推理时联合探索这两种策略。我们采用7nm AMD Versal ACAP VCK190板卡，为四个端到端基于Transformer的深度学习模型实现了SSR加速器。在不同批处理规模下，SSR相比8nm Nvidia GPU A10G、16nm AMD FPGA ZCU102和U250，平均吞吐量提升达2.53倍、35.71倍和14.20倍；平均能效提升分别为8.51倍、6.75倍和21.22倍。与VCK190上仅采用序列方案和仅采用空间方案相比，我们的空间-序列混合方案在相同延迟需求下实现更高吞吐量，并在相同吞吐量需求下实现更低延迟。我们还利用SSR分析模型演示了如何将SSR应用于其他计算平台（如14nm Intel Stratix 10 NX）的优化方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日