利用存内计算实现TPU中生成模型的高效推理 (Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs)

With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.

翻译：随着生成模型的快速发展，在专用硬件上高效部署这些模型变得至关重要。张量处理单元（TPU）专为加速人工智能工作负载而设计，但其高功耗特性要求通过创新提升能效。存内计算（CIM）作为一种具有优异面积效率与能效的范式应运而生。本研究提出一种集成数字存内计算以替代矩阵乘法单元（MXU）中传统数字脉动阵列的TPU架构。我们首先建立了基于存内计算的TPU架构模型与仿真器，以评估存内计算在多种生成模型推理任务中的优势。基于观察到的设计启示，我们进一步探索了多种基于存内计算的TPU架构设计方案。与基准TPUv4i架构相比，不同设计方案可在大语言模型和扩散Transformer推理任务中实现最高44.2%和33.8%的性能提升，并使MXU能耗降低27.3倍。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日