Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a promising solution. The 3D-stacked AI chip enables ultra-high memory bandwidth between compute and memory by stacking numerous DRAM banks atop many AI cores in a distributed manner. However, it is not easy to explore the efficiency of the 3D-stacked AI chip, due to its unique distributed nature. And we need to carefully consider multiple intertwined factors that range from upper-level computing paradigm to machine learning (ML) compiler optimizations, and to the underlying hardware architecture. In this paper, we develop Voxel, a fast and compiler-aware end-to-end simulation framework to facilitate exploring the efficiency of 3D-stacked AI chips for large language model (LLM) inference. Voxel enables the software/hardware co-exploration by employing a programming interface that allows ML compilers to customize the model execution plans. After validating the results of Voxel with an emulator on real silicon, we thoroughly examine the impact and correlation of different aspects of 3D-stacked AI chips, including state-of-the-art compute paradigms, tile-to-core mapping, tensor-to-bank mapping, NoC topologies and link bandwidth, DRAM bank bandwidth, per-core SRAM capacity, and energy/thermal constraints. Our findings disclose that the end-to-end efficiency of a 3D stacked AI chip not only is determined by the cooperative function of these factors, but also significantly depends on the mappings from tiles to AI core and DRAM banks. We report our findings throughout the paper, with the expectation that they will shed light on the development of the 3D-stacked AI chip ecosystem. We will open source Voxel and our study results for public research.

翻译：为突破AI芯片广为人知的存储瓶颈，基于高密度硅通孔先进封装技术的三维堆叠架构已被证明是一种极具前景的解决方案。三维堆叠AI芯片通过将大量DRAM存储体以分布式方式堆叠在众多AI计算核心之上，实现了计算单元与存储单元间的超高速内存带宽。然而，由于其独特的分布式特性，评估三维堆叠AI芯片的效率并非易事。我们需要审慎考量多个相互交织的因素，涵盖从上层计算范式到机器学习编译器优化，直至底层硬件架构的全链路。本文提出Voxel——一个支持编译器的快速端到端仿真框架，旨在促进面向大语言模型推理场景的三维堆叠AI芯片效率探索。Voxel通过提供允许机器学习编译器定制模型执行计划的编程接口，实现软硬件协同探索。在利用真实硅片仿真器验证Voxel结果后，我们系统研究了三维堆叠AI芯片各维度的影响与关联，包括先进计算范式、计算片到核心映射、张量到存储体映射、片上网络拓扑与链路带宽、DRAM存储体带宽、每核心SRAM容量以及能耗/热约束。研究结果表明，三维堆叠AI芯片的端到端效率不仅取决于这些因素的协同作用，更显著依赖于计算片到AI核心及DRAM存储体的映射策略。本文系统呈现了各项研究发现，期望能为三维堆叠AI芯片生态发展提供启示。我们将开源Voxel框架及研究成果供学术界使用。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《面向边缘AI应用的高性能高能效架构探索》156页

专知会员服务

37+阅读 · 2025年4月12日

边缘AI行业深度：边缘AI硬件，引领硬件创新时代

专知会员服务

51+阅读 · 2024年4月18日

AI大模型风起云涌，半导体与光模块长期受益

专知会员服务

30+阅读 · 2024年3月14日

【ChatGPT系列报告】ChatGPT：存算一体，算力的下一极，47页ppt

专知会员服务

89+阅读 · 2023年4月6日