Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design

Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth rather than raw MAC count. This work proposes a unified system-accelerator co-design approach for transformer inference that jointly optimizes a matrix accelerator and its system integration through paged streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow, a loosely coupled 16x16 systolic-array accelerator with a page-aligned block matrix multiplication method using 4 KB tiles, a small on-chip buffer of about 20 KB, and a pipelined schedule of DMA, compute, and DMA-out to utilize interconnect bandwidth efficiently. On the system side, we develop Gem5-AcceSys, an extension of the gem5 full-system simulator that explores standard interconnects such as PCIe and configurable memory hierarchies including Direct Memory, Direct Cache, and Device Memory modes with SMMU/TLB effects. We evaluate the co-design using gem5 simulations on representative transformer models including BERT and ViT across multiple data types and system setups. Results show up to 22x end-to-end speedup over a CPU-only baseline and 5x to 8x gains over state-of-the-art loosely and tightly coupled accelerators. We further show that a standard PCIe-based host-memory design can achieve about 80 percent of the performance of on-device HBM. Overall, paged streaming and pipeline overlap, rather than large local SRAMs, are the most effective levers for efficient transformer inference under realistic system constraints.

翻译：Transformer已在自然语言处理和计算机视觉领域引发了AI革命，但其巨大的算力和内存需求对硬件加速提出了重大挑战。实际应用中，端到端吞吐量往往受限于分页数据传输和互连带宽，而非原始MAC计算量。本文提出了一种面向Transformer推理的统一系统-加速器协同设计方案，通过分页流式数据流和显式重叠计算与传输操作，联合优化矩阵加速器及其系统集成。硬件方面，我们推出了MatrixFlow——一种松耦合的16x16脉动阵列加速器，采用基于4KB数据块的对齐分页块矩阵乘法方法，配备约20KB小型片上缓存，并设计流水线化的DMA输入、计算与DMA输出调度，以高效利用互连带宽。系统方面，我们开发了Gem5-AcceSys——基于gem5全系统模拟器的扩展框架，支持PCIe等标准互连和可配置存储层次（包括直连内存模式、直连缓存模式及带SMMU/TLB效应的设备内存模式）。我们使用gem5模拟器对BERT和ViT等代表性Transformer模型进行了多数据类型和系统配置的协同设计评估。结果表明，相较于纯CPU基线，端到端加速比最高可达22倍；相较于当前最先进的松耦合与紧耦合加速器，加速比提升5至8倍。进一步研究表明，标准PCIe主机内存设计可达到设备端HBM约80%的性能。总体而言，在现实系统约束下，分页流式传输与流水线重叠（而非大容量本地SRAM）是实现高效Transformer推理的最有效手段。