DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

Zhiwen Mo,Guoyu Li, Hao, Chen,Yu Cheng,Zhengju Tang,Qianzhou Wang,Lei Wang,Shuang Liang,Lingxiao Ma,Xianqi Zhou,Yuxiao Guo,Wayne Luk,Jilong Xue,Hongxiang Fan

Advances in hybrid bonding and packaging have driven growing interest in 3D DRAM-stacked accelerators with higher memory bandwidth and capacity. As LLMs scale to hundreds of billions or trillions of parameters, distributed inference across multiple 3D chips becomes essential. With cross-stack co-design increasingly critical, we propose DeepStack, an accurate and efficient performance model and tool to enable early-stage system-hardware co-design space exploration (DSE) for distributed 3D-stacked AI systems. At the hardware level, DeepStack captures fine-grained 3D memory semantics such as transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling. At the system level, DeepStack incorporates comprehensive parallelization strategies and execution scheduling for distributed LLM inference. With novel modeling techniques such as dual-stage network abstraction and tile-level compute-communication overlap, we achieve up to 100,000x faster runtime over state-of-the-art simulators at comparable accuracy, cross-validated against our in-house 3D designs, NS-3 backend (2.12%), and vLLM serving on 8xB200 GPUs (12.18%). With hierarchical design space search, DeepStack enables efficient exploration over 2.5x10^14 design points spanning 3D-stacked DRAM layers, DRAM vertical connectivity, interconnect, compute-memory allocation, and distributed scheduling. Compared with baseline designs, DeepStack achieves up to 9.5x higher throughput through co-optimized parallelism and 3D architecture search. Our DSE further reveals that batch size drives a more fundamental architectural divide than the prefill/decode distinction, and that parallelism strategy and hardware architecture are tightly coupled -- incomplete schedule search leads to permanently suboptimal silicon irrecoverable by software tuning. We intend to open source DeepStack to support future research.

翻译：混合键合与封装技术的进步推动了对高内存带宽与容量的三维动态随机存取存储器（3D DRAM）堆叠加速器的日益关注。随着大语言模型（LLM）规模扩展至数千亿或数万亿参数，跨多个3D芯片的分布式推理变得至关重要。鉴于跨堆叠协同设计日益关键，我们提出DeepStack——一套精确高效的性能模型与工具，以支持分布式三维堆叠AI系统的早期系统-硬件协同设计空间探索。在硬件层面，DeepStack捕获细粒度三维内存语义，如事务感知带宽、存储体激活约束、缓冲限制及热功耗建模。在系统层面，DeepStack集成了面向分布式LLM推理的全面并行化策略与执行调度。通过双阶段网络抽象与瓦片级计算-通信重叠等新颖建模技术，我们实现了相较于最先进模拟器高达100,000倍的运行速度提升，同时保持可比精度，并通过内部3D设计、NS-3后端（误差2.12%）及基于8×B200 GPU的vLLM服务（误差12.18%）进行交叉验证。结合分层设计空间搜索，DeepStack能够在涵盖三维堆叠DRAM层、DRAM垂直互连、互连网络、计算-内存分配及分布式调度的2.5×10^14个设计点上实现高效探索。与基线设计相比，DeepStack通过并行化与三维架构的协同优化，将吞吐量提升高达9.5倍。我们的设计空间探索进一步揭示：批量大小比预填充/解码区分更根本地驱动架构分野，且并行策略与硬件架构紧密耦合——不完整的调度搜索将导致永久性次优硅片，无法通过软件调优弥补。我们计划开源DeepStack以支持未来研究。