With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.
翻译:随着生成模型的快速发展,在专用硬件上高效部署这些模型变得至关重要。张量处理单元(TPU)专为加速人工智能工作负载而设计,但其高功耗特性要求通过创新提升能效。存内计算(CIM)作为一种具有优异面积效率与能效的范式应运而生。本研究提出一种集成数字存内计算以替代矩阵乘法单元(MXU)中传统数字脉动阵列的TPU架构。我们首先建立了基于存内计算的TPU架构模型与仿真器,以评估存内计算在多种生成模型推理任务中的优势。基于观察到的设计启示,我们进一步探索了多种基于存内计算的TPU架构设计方案。与基准TPUv4i架构相比,不同设计方案可在大语言模型和扩散Transformer推理任务中实现最高44.2%和33.8%的性能提升,并使MXU能耗降低27.3倍。