To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a promising solution. The 3D-stacked AI chip enables ultra-high memory bandwidth between compute and memory by stacking numerous DRAM banks atop many AI cores in a distributed manner. However, it is not easy to explore the efficiency of the 3D-stacked AI chip, due to its unique distributed nature. And we need to carefully consider multiple intertwined factors that range from upper-level computing paradigm to machine learning (ML) compiler optimizations, and to the underlying hardware architecture. In this paper, we develop Voxel, a fast and compiler-aware end-to-end simulation framework to facilitate exploring the efficiency of 3D-stacked AI chips for large language model (LLM) inference. Voxel enables the software/hardware co-exploration by employing a programming interface that allows ML compilers to customize the model execution plans. After validating the results of Voxel with an emulator on real silicon, we thoroughly examine the impact and correlation of different aspects of 3D-stacked AI chips, including state-of-the-art compute paradigms, tile-to-core mapping, tensor-to-bank mapping, NoC topologies and link bandwidth, DRAM bank bandwidth, per-core SRAM capacity, and energy/thermal constraints. Our findings disclose that the end-to-end efficiency of a 3D stacked AI chip not only is determined by the cooperative function of these factors, but also significantly depends on the mappings from tiles to AI core and DRAM banks. We report our findings throughout the paper, with the expectation that they will shed light on the development of the 3D-stacked AI chip ecosystem. We will open source Voxel and our study results for public research.
翻译:为突破AI芯片广为人知的存储瓶颈,基于高密度硅通孔先进封装技术的三维堆叠架构已被证明是一种极具前景的解决方案。三维堆叠AI芯片通过将大量DRAM存储体以分布式方式堆叠在众多AI计算核心之上,实现了计算单元与存储单元间的超高速内存带宽。然而,由于其独特的分布式特性,评估三维堆叠AI芯片的效率并非易事。我们需要审慎考量多个相互交织的因素,涵盖从上层计算范式到机器学习编译器优化,直至底层硬件架构的全链路。本文提出Voxel——一个支持编译器的快速端到端仿真框架,旨在促进面向大语言模型推理场景的三维堆叠AI芯片效率探索。Voxel通过提供允许机器学习编译器定制模型执行计划的编程接口,实现软硬件协同探索。在利用真实硅片仿真器验证Voxel结果后,我们系统研究了三维堆叠AI芯片各维度的影响与关联,包括先进计算范式、计算片到核心映射、张量到存储体映射、片上网络拓扑与链路带宽、DRAM存储体带宽、每核心SRAM容量以及能耗/热约束。研究结果表明,三维堆叠AI芯片的端到端效率不仅取决于这些因素的协同作用,更显著依赖于计算片到AI核心及DRAM存储体的映射策略。本文系统呈现了各项研究发现,期望能为三维堆叠AI芯片生态发展提供启示。我们将开源Voxel框架及研究成果供学术界使用。