DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators - 专知论文

会员服务 ·

0

3D · 设计 · AI · 推断 · MoDELS ·

DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

翻译：暂无翻译

Zhiwen Mo,Guoyu Li,Hao Mark Chen,Yu Cheng,Zhengju Tang,Qianzhou Wang,Lei Wang,Shuang Liang,Lingxiao Ma,Xianqi Zhou,Yuxiao Guo,Wayne Luk,Jilong Xue,Hongxiang Fan

from arxiv, fix typo

Advances in hybrid bonding and packaging have driven growing interest in 3D DRAM-stacked accelerators with higher memory bandwidth and capacity. As LLMs scale to hundreds of billions or trillions of parameters, distributed inference across multiple 3D chips becomes essential. With cross-stack co-design increasingly critical, we propose DeepStack, an accurate and efficient performance model and tool to enable early-stage system-hardware co-design space exploration (DSE) for distributed 3D-stacked AI systems. At the hardware level, DeepStack captures fine-grained 3D memory semantics such as transaction-aware bandwidth, bank activation constraints, buffering limitations, and thermal-power modeling. At the system level, DeepStack incorporates comprehensive parallelization strategies and execution scheduling for distributed LLM inference. With novel modeling techniques such as dual-stage network abstraction and tile-level compute-communication overlap, we achieve up to 100,000x faster runtime over state-of-the-art simulators at comparable accuracy, cross-validated against our in-house 3D designs, NS-3 backend (2.12%), and vLLM serving on 8xB200 GPUs (12.18%). With hierarchical design space search, DeepStack enables efficient exploration over 2.5x10^14 design points spanning 3D-stacked DRAM layers, DRAM vertical connectivity, interconnect, compute-memory allocation, and distributed scheduling. Compared with baseline designs, DeepStack achieves up to 9.5x higher throughput through co-optimized parallelism and 3D architecture search. Our DSE further reveals that batch size drives a more fundamental architectural divide than the prefill/decode distinction, and that parallelism strategy and hardware architecture are tightly coupled -- incomplete schedule search leads to permanently suboptimal silicon irrecoverable by software tuning. We intend to open source DeepStack to support future research.

翻译：暂无翻译

0

相关内容

3D是英文“Three Dimensions”的简称，中文是指三维、三个维度、三个坐标，即有长、有宽、有高，换句话说，就是立体的，是相对于只有长和宽的平面（2D）而言。

DeepSeek开源大模型「记忆」模块，D梁文锋署名新论文，下一代稀疏模型提前剧透

DeepSeek开源大模型「记忆」模块，D梁文锋署名新论文，下一代稀疏模型提前剧透

专知会员服务

18+阅读 · 1月13日

DeepSeek技术溯源及前沿探索

DeepSeek技术溯源及前沿探索

专知会员服务

34+阅读 · 2025年5月28日

OpenAI Sora核心技术，被曝缝合自DeepMind和谢赛宁论文？机器模拟人类世界迈出第一步

OpenAI Sora核心技术，被曝缝合自DeepMind和谢赛宁论文？机器模拟人类世界迈出第一步

专知会员服务

50+阅读 · 2024年2月18日

《用于计算机系统的人工智能增强设计空间探索的机器学习》哥伦比亚大学2022最新博士论文

《用于计算机系统的人工智能增强设计空间探索的机器学习》哥伦比亚大学2022最新博士论文

专知会员服务

16+阅读 · 2022年6月6日

【SIGIR2022】Space4HGNN:一种新型、模块化和可复制的异构图神经网络评估平台

【SIGIR2022】Space4HGNN:一种新型、模块化和可复制的异构图神经网络评估平台

专知会员服务

12+阅读 · 2022年4月3日

【CVPR 2022】连续驾驶场景与不断增长的建筑的连续立体匹配，Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture

【CVPR 2022】连续驾驶场景与不断增长的建筑的连续立体匹配，Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture

专知会员服务

11+阅读 · 2022年3月12日

【CVPR 2022】基于分层解析胶囊网络的无监督人脸部分发现，HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network

【CVPR 2022】基于分层解析胶囊网络的无监督人脸部分发现，HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network

专知会员服务

10+阅读 · 2022年3月12日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

专知会员服务

208+阅读 · 2019年9月30日

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

GAN生成式对抗网络

14+阅读 · 2019年5月23日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【泡泡点云时空】FoldingNet：通过深度栅格变形的点云自编码器（CVPR2018-2）

【泡泡点云时空】FoldingNet：通过深度栅格变形的点云自编码器（CVPR2018-2）

泡泡机器人SLAM

10+阅读 · 2018年8月7日

陈天奇团队推出开源AI芯片栈VTA，降低芯片设计门槛

陈天奇团队推出开源AI芯片栈VTA，降低芯片设计门槛

AI前线

15+阅读 · 2018年7月13日

DeepMind高级研究员：重新理解GAN，最新算法、技巧及应用（59页PPT）

DeepMind高级研究员：重新理解GAN，最新算法、技巧及应用（59页PPT）

新智元

16+阅读 · 2018年2月5日

Deepmind 新成果，让机器挑战更复杂阅读理解问题

Deepmind 新成果，让机器挑战更复杂阅读理解问题

AI掘金志

11+阅读 · 2018年1月3日

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

专知

27+阅读 · 2017年12月17日

【深度强化学习】深度强化学习揭秘

【深度强化学习】深度强化学习揭秘

产业智能官

21+阅读 · 2017年11月13日

临近空间高超声速飞行器低复杂度再入姿态控制器设计研究

国家自然科学基金

1+阅读 · 2015年12月31日

三维堆叠DRAM的低功耗刷新技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

下一代异构移动网络中分布式云存储的设计与研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于3D稀疏表示的多模态神经导航关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

保留时域精细结构的高生物拟真全植入式神经形态人工耳蜗芯片设计

国家自然科学基金

0+阅读 · 2015年12月31日

3D堆叠众核处理器共享存储访问均衡性研究

国家自然科学基金

0+阅读 · 2015年12月31日

“非对称多通道”异质、异构内存系统架构及“启发式”混合内存资源管理机制的研究

国家自然科学基金

0+阅读 · 2015年12月31日

嵌入式存储器容错设计关键技术研究

国家自然科学基金

1+阅读 · 2014年12月31日

面向大数据计算的高吞吐量众核处理器关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Arxiv

0+阅读 · 4月29日

3D Generation for Embodied AI and Robotic Simulation: A Survey

Arxiv

0+阅读 · 4月29日

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

Arxiv

0+阅读 · 4月27日

A Periodic Space of Distributed Computing: Vision & Framework

Arxiv

0+阅读 · 4月14日

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Arxiv

0+阅读 · 4月9日

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Arxiv

0+阅读 · 3月30日

Decentralized Task Scheduling in Distributed Systems: A Deep Reinforcement Learning Approach

Arxiv

0+阅读 · 3月25日

SkyHOST: A Unified Architecture for Cross-Cloud Hybrid Object and Stream Transfer

Arxiv

0+阅读 · 3月20日

DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

Arxiv

0+阅读 · 3月20日

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

Arxiv

11+阅读 · 2023年3月5日

VIP会员

文章信息

相关主题

最新内容

DeepSeek 版Claude Code，免费小白安装教程来了！

DeepSeek 版Claude Code，免费小白安装教程来了！

专知会员服务

9+阅读 · 5月5日

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

5+阅读 · 5月5日

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

专知会员服务

5+阅读 · 5月5日

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

专知会员服务

6+阅读 · 5月5日

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

专知会员服务

9+阅读 · 5月5日

《美空军条令出版物 2-0：情报（2026版）》

《美空军条令出版物 2-0：情报（2026版）》

专知会员服务

14+阅读 · 5月5日

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

专知会员服务

6+阅读 · 5月5日

帕兰提尔 Gotham：一个游戏规则改变器

帕兰提尔 Gotham：一个游戏规则改变器

专知会员服务

9+阅读 · 5月5日

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

专知会员服务

3+阅读 · 5月5日

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

专知会员服务

3+阅读 · 5月5日

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

专知会员服务

8+阅读 · 5月4日

【综述】机器人学习中的世界模型：全面综述

【综述】机器人学习中的世界模型：全面综述

专知会员服务

12+阅读 · 5月4日

伊朗的导弹-无人机行动及其对美国威慑的影响

伊朗的导弹-无人机行动及其对美国威慑的影响

专知会员服务

9+阅读 · 5月4日

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

专知会员服务

9+阅读 · 5月4日

战争贩子：2026年第一季度美国对中东潜在军售激增

战争贩子：2026年第一季度美国对中东潜在军售激增

专知会员服务

7+阅读 · 5月4日

相关VIP内容

DeepSeek开源大模型「记忆」模块，D梁文锋署名新论文，下一代稀疏模型提前剧透

DeepSeek开源大模型「记忆」模块，D梁文锋署名新论文，下一代稀疏模型提前剧透

专知会员服务

18+阅读 · 1月13日

DeepSeek技术溯源及前沿探索

DeepSeek技术溯源及前沿探索

专知会员服务

34+阅读 · 2025年5月28日

OpenAI Sora核心技术，被曝缝合自DeepMind和谢赛宁论文？机器模拟人类世界迈出第一步

OpenAI Sora核心技术，被曝缝合自DeepMind和谢赛宁论文？机器模拟人类世界迈出第一步

专知会员服务

50+阅读 · 2024年2月18日

《用于计算机系统的人工智能增强设计空间探索的机器学习》哥伦比亚大学2022最新博士论文

《用于计算机系统的人工智能增强设计空间探索的机器学习》哥伦比亚大学2022最新博士论文

专知会员服务

16+阅读 · 2022年6月6日

【SIGIR2022】Space4HGNN:一种新型、模块化和可复制的异构图神经网络评估平台

【SIGIR2022】Space4HGNN:一种新型、模块化和可复制的异构图神经网络评估平台

专知会员服务

12+阅读 · 2022年4月3日

【CVPR 2022】连续驾驶场景与不断增长的建筑的连续立体匹配，Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture

【CVPR 2022】连续驾驶场景与不断增长的建筑的连续立体匹配，Continual Stereo Matching of Continuous Driving Scenes with Growing Architecture

专知会员服务

11+阅读 · 2022年3月12日

【CVPR 2022】基于分层解析胶囊网络的无监督人脸部分发现，HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network

【CVPR 2022】基于分层解析胶囊网络的无监督人脸部分发现，HP-Capsule: Unsupervised Face Part Discovery by Hierarchical Parsing Capsule Network

专知会员服务

10+阅读 · 2022年3月12日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

专知会员服务

208+阅读 · 2019年9月30日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

DeepSeek 版Claude Code，免费小白安装教程来了！

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

相关资讯

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

GAN生成式对抗网络

14+阅读 · 2019年5月23日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【泡泡点云时空】FoldingNet：通过深度栅格变形的点云自编码器（CVPR2018-2）

【泡泡点云时空】FoldingNet：通过深度栅格变形的点云自编码器（CVPR2018-2）

泡泡机器人SLAM

10+阅读 · 2018年8月7日

陈天奇团队推出开源AI芯片栈VTA，降低芯片设计门槛

陈天奇团队推出开源AI芯片栈VTA，降低芯片设计门槛

AI前线

15+阅读 · 2018年7月13日

DeepMind高级研究员：重新理解GAN，最新算法、技巧及应用（59页PPT）

DeepMind高级研究员：重新理解GAN，最新算法、技巧及应用（59页PPT）

新智元

16+阅读 · 2018年2月5日

Deepmind 新成果，让机器挑战更复杂阅读理解问题

Deepmind 新成果，让机器挑战更复杂阅读理解问题

AI掘金志

11+阅读 · 2018年1月3日

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

专知

27+阅读 · 2017年12月17日

【深度强化学习】深度强化学习揭秘

【深度强化学习】深度强化学习揭秘

产业智能官

21+阅读 · 2017年11月13日

相关论文

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Arxiv

0+阅读 · 4月29日

3D Generation for Embodied AI and Robotic Simulation: A Survey

Arxiv

0+阅读 · 4月29日

Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

Arxiv

0+阅读 · 4月27日

A Periodic Space of Distributed Computing: Vision & Framework

Arxiv

0+阅读 · 4月14日

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Arxiv

0+阅读 · 4月9日

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Arxiv

0+阅读 · 3月30日

Decentralized Task Scheduling in Distributed Systems: A Deep Reinforcement Learning Approach

Arxiv

0+阅读 · 3月25日

SkyHOST: A Unified Architecture for Cross-Cloud Hybrid Object and Stream Transfer

Arxiv

0+阅读 · 3月20日

DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

Arxiv

0+阅读 · 3月20日

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

Arxiv

11+阅读 · 2023年3月5日

相关基金

临近空间高超声速飞行器低复杂度再入姿态控制器设计研究

国家自然科学基金

1+阅读 · 2015年12月31日

三维堆叠DRAM的低功耗刷新技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

下一代异构移动网络中分布式云存储的设计与研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于3D稀疏表示的多模态神经导航关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

保留时域精细结构的高生物拟真全植入式神经形态人工耳蜗芯片设计

国家自然科学基金

0+阅读 · 2015年12月31日

3D堆叠众核处理器共享存储访问均衡性研究

国家自然科学基金

0+阅读 · 2015年12月31日

“非对称多通道”异质、异构内存系统架构及“启发式”混合内存资源管理机制的研究

国家自然科学基金

0+阅读 · 2015年12月31日

嵌入式存储器容错设计关键技术研究

国家自然科学基金

1+阅读 · 2014年12月31日

面向大数据计算的高吞吐量众核处理器关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员