Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems

The advent of Transformers has revolutionized computer vision, offering a powerful alternative to convolutional neural networks (CNNs), especially with the local attention mechanism that excels at capturing local structures within the input and achieve state-of-the-art performance. Processing in-memory (PIM) architecture offers extensive parallelism, low data movement costs, and scalable memory bandwidth, making it a promising solution to accelerate Transformer with memory-intensive operations. However, the crucial challenge lies in efficiently deploying the entire model onto a resource-limited PIM system while parallelizing each transformer block with potentially many computational branches based on local attention mechanisms. We present Allspark, which focuses on workload orchestration for visual Transformers on PIM systems, aiming at minimizing inference latency. Firstly, to fully utilize the massive parallelism of PIM, Allspark empolys a finer-grained partitioning scheme for computational branches, and format a systematic layout and interleaved dataflow with maximized data locality and reduced data movement. Secondly, Allspark formulates the scheduling of the complete model on a resource-limited distributed PIM system as an integer linear programming (ILP) problem. Thirdly, as local-global data interactions exhibit complex yet regular dependencies, Allspark provides a greedy-based mapping method to allocate computational branches onto the PIM system and minimize NoC communication costs. Extensive experiments on 3D-stacked DRAM-based PIM systems show that Allspark brings 1.2x-24.0x inference speedup for various visual Transformers over baselines, and that Allspark-enriched PIM system yields average speedups of 2.3x and energy savings of 20x-55x over Nvidia V100 GPU.

翻译：摘要：Transformer的出现彻底改变了计算机视觉领域，为卷积神经网络（CNN）提供了强有力的替代方案，尤其是局部注意力机制在捕捉输入中的局部结构方面表现出色，并取得了最先进的性能。处理-内存（PIM）架构提供了大规模并行性、低数据移动成本和可扩展的内存带宽，使其成为加速内存密集型Transformer操作的有前景的解决方案。然而，关键挑战在于如何在资源受限的PIM系统上高效部署完整的模型，同时并行化每个Transformer块中基于局部注意力机制的多个可能计算分支。我们提出Allspark，专注于在PIM系统上对视觉Transformer进行工作负载编排，以最小化推理延迟。首先，为充分利用PIM的大规模并行性，Allspark对计算分支采用更细粒度的划分方案，并设计系统化的布局和交错数据流，以最大化数据局部性并减少数据移动。其次，Allspark将完整模型在资源受限的分布式PIM系统上的调度问题形式化为整数线性规划（ILP）问题。第三，针对局部-全局数据交互中复杂但规则的依赖关系，Allspark提供了一种基于贪心的映射方法，将计算分支分配到PIM系统上，并最小化片上网络（NoC）通信成本。在基于3D堆叠DRAM的PIM系统上的大量实验表明，与基线相比，Allspark对多种视觉Transformer实现了1.2倍至24.0倍的推理加速；同时，集成Allspark的PIM系统相比Nvidia V100 GPU平均加速2.3倍，并实现20倍至55倍的能耗节省。