Large language models (LLMs) exhibit memory-intensive behavior during decoding, making it a key bottleneck in LLM inference. To accelerate decoding execution, hybrid-bonding-based 3D-DRAM has been adopted in LLM accelerators. While this emerging technology provides strong performance gains over existing hardware, current 3D-DRAM accelerators (3D-Accelerators) rely on closed-source evaluation tools, limiting access to publicly available performance analysis methods. Moreover, existing designs are highly customized for specific scenarios, lacking a general and reusable full-stack modeling for 3D-Accelerators across diverse usecases. To bridge this fundamental gap, we present ATLAS, the first silicon-proven Architectural Three-dimesional-DRAM-based LLM Accelerator Simulation framework. Built on commercially deployed multi-layer 3D-DRAM technology, ATLAS introduces unified abstractions for both 3D-Accelerator system architecture and programming primitives to support arbitrary LLM inference scenarios. Validation against real silicon shows that ATLAS achieves $\le$8.57% simulation error and 97.26-99.96\% correlation with measured performance. Through design space exploration with ATLAS, we demonstrate its ability to guide architecture design and distill key takeaways for both 3D-DRAM memory system and 3D-Accelerator microarchitecture across scenarios. ATLAS will be open-sourced upon publication, enabling further research on 3D-Accelerators.
翻译:大型语言模型在解码过程中展现出高度的内存密集型特性,使其成为推理的关键瓶颈。为加速解码执行,混合键合型三维DRAM已被应用于大型语言模型加速器。尽管这一新兴技术相比现有硬件能带来显著的性能提升,但当前三维DRAM加速器依赖闭源评估工具,限制了公开性能分析方法的可获取性。此外,现有设计高度定制化,缺乏支持多样化场景的通用可重用全栈建模方法。为弥补这一根本性空白,我们提出了ATLAS——首个经硅验证的架构级三维DRAM大型语言模型加速器仿真框架。基于商业部署的多层三维DRAM技术,ATLAS引入了统一的系统架构抽象层与编程原语,以支持任意推理场景。与真实芯片的验证结果显示,ATLAS的仿真误差≤8.57%,与实测性能的相关性达97.26%-99.96%。通过ATLAS的设计空间探索,我们展示了其在跨场景场景中指导三维DRAM存储系统与加速器微架构设计的能力,并提炼出关键结论。ATLAS将在论文发表后开源,以推动三维加速器的后续研究。