DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training - 专知论文

会员服务 ·

0

异构 · 深度神经网络 · 芯片 · 边缘 · 灵活性 ·

2023 年 3 月 31 日

DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

翻译：DARKSIDE：面向极致边缘片上DNN推理与训练的异构RISC-V计算集群

Angelo Garofalo,Yvan Tortorella,Matteo Perotti,Luca Valente,Alessandro Nadalini,Luca Benini,Davide Rossi,Francesco Conti

from arxiv, 11 pages, 15 figures

On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency - enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.

翻译：极致边缘（TinyML）的片上DNN推理与训练对延迟、吞吐量、精度和灵活性提出了严格的要求。异构集群是应对这一挑战的可行方案，它将DSP增强型内核的灵活性与专用加速器的高性能和高能效相结合。本文提出DARKSIDE——一款片上系统，包含由8个RISC-V内核组成的异构集群，并配备了2至32位混合精度整数运算增强单元。为提升关键计算密集型深度神经网络（DNN）内核的性能与效率，该集群集成了三种数字加速器：一种用于低数据复用深度可分离卷积内核的专用引擎（最高达30 MAC/周期）；一种用于动态整理1至32位数据的低开销数据搬运器；以及一种用于平铺矩阵乘法加速的16位浮点张量乘积引擎（TPE）。DARKSIDE采用65nm CMOS工艺实现。在处理2位整数DNN内核时，该集群峰值整数性能达65 GOPS，峰值能效达835 GOPS/W；面向浮点张量运算时，TPE可提供高达18.2 GFLOPS的性能或300 GFLOPS/W的能效——足以在支持超低功耗量化推理的同时，以具备竞争力的速度实现片上浮点训练。

0

相关内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

128+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

专知会员服务

43+阅读 · 2020年7月19日

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

专知会员服务

63+阅读 · 2020年7月12日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

专知会员服务

14+阅读 · 2019年11月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

图与推荐

3+阅读 · 2022年9月20日

使用 Keras Tuner 调节超参数

使用 Keras Tuner 调节超参数

TensorFlow

15+阅读 · 2020年2月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

专知

18+阅读 · 2018年7月15日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

基于GPU的脉冲星宽带观测的相干消色散研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向通用计算集群的全局GPU虚拟化理论与方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

利用GPU实现大规模复杂体系反应分子动力学模拟的方法

国家自然科学基金

2+阅读 · 2012年12月31日

非饱和颗粒材料水力-力学耦合过程两尺度分析的二阶计算均匀化方法

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向众核体系结构的操作系统并行优化关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

天体测量中的高并行图像处理方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向NBTI的SOC芯片可靠性设计关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于人工神经网络的结核病疫情预测研究及软件实现

国家自然科学基金

1+阅读 · 2008年12月31日

移动终端有限资源下的三维图形处理方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

An Experimental Investigation of Tuning QUIC-Based Publish-Subscribe Architectures in IoT

Arxiv

0+阅读 · 2023年5月19日

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Arxiv

0+阅读 · 2023年5月19日

Scaling Up Dynamic Graph Representation Learning via Spiking Neural Networks

Arxiv

0+阅读 · 2023年5月18日

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

Arxiv

0+阅读 · 2023年5月18日

Less Can Be More: Unsupervised Graph Pruning for Large-scale Dynamic Graphs

Arxiv

0+阅读 · 2023年5月18日

Training Graph Neural Networks with 1000 Layers

Arxiv

13+阅读 · 2021年6月14日

Graph Neural Networks with Heterophily

Arxiv

19+阅读 · 2021年2月4日

Directional Graph Networks

Directional Graph Networks

Arxiv

27+阅读 · 2020年12月10日

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Arxiv

12+阅读 · 2020年6月24日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

VIP会员

文章信息

相关主题

深度神经网络

最新内容

面向国防作战的最佳自主与蜂群无人机技术

面向国防作战的最佳自主与蜂群无人机技术

专知会员服务

3+阅读 · 今天8:04

《异构人类团队的协作决策过程混合建模研究》

《异构人类团队的协作决策过程混合建模研究》

专知会员服务

4+阅读 · 今天7:59

《C5ISR系统中的注意力动态与自适应决策支持研究：视觉与多模态注意力引导对任务绩效影响的递归量化分析》最新36页报告

《C5ISR系统中的注意力动态与自适应决策支持研究：视觉与多模态注意力引导对任务绩效影响的递归量化分析》最新36页报告

专知会员服务

4+阅读 · 今天7:56

《设计思维中的人机协作：生成式人工智能对共情访谈影响的探究》140页

《设计思维中的人机协作：生成式人工智能对共情访谈影响的探究》140页

专知会员服务

4+阅读 · 今天7:50

博士论文 | 面向大模型推理的内存高效算法

博士论文 | 面向大模型推理的内存高效算法

专知会员服务

4+阅读 · 7月27日

论文解读 | 从预训练到后训练：理解大模型推理能力如何形成

论文解读 | 从预训练到后训练：理解大模型推理能力如何形成

专知会员服务

5+阅读 · 7月27日

《无人系统互操作性导论——无人系统联合架构（JAUS）》

《无人系统互操作性导论——无人系统联合架构（JAUS）》

专知会员服务

13+阅读 · 7月27日

美空军新型反无人机部队初探

美空军新型反无人机部队初探

专知会员服务

7+阅读 · 7月27日

《对抗性电磁环境下远程巡飞弹作战的安全指挥与控制数据链》

《对抗性电磁环境下远程巡飞弹作战的安全指挥与控制数据链》

专知会员服务

7+阅读 · 7月27日

《北约下一代建模与仿真（NexGen M&S）计划》2026年69页

《北约下一代建模与仿真（NexGen M&S）计划》2026年69页

专知会员服务

5+阅读 · 7月27日

《防空交战流程的概率建模研究》

《防空交战流程的概率建模研究》

专知会员服务

11+阅读 · 7月27日

ICML 2026 教程 | 数值优化理论还重要吗？

ICML 2026 教程 | 数值优化理论还重要吗？

专知会员服务

7+阅读 · 7月26日

ICM 2026 | 陶哲轩：人工智能时代的数学

ICM 2026 | 陶哲轩：人工智能时代的数学

专知会员服务

10+阅读 · 7月26日

《面向可扩展高韧性无人机集群网络的速度感知分层通信框架》

《面向可扩展高韧性无人机集群网络的速度感知分层通信框架》

专知会员服务

9+阅读 · 7月26日

《面向概率推理的可定制战术引擎及其在军事任务规划中的应用》

《面向概率推理的可定制战术引擎及其在军事任务规划中的应用》

专知会员服务

12+阅读 · 7月26日

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

128+阅读 · 2022年4月21日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

专知会员服务

43+阅读 · 2020年7月19日

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

可解释高效异构图卷积网络，Interpretable and Efficient Heterogeneous Graph Convolutional Network

专知会员服务

63+阅读 · 2020年7月12日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

【Nature论文】用于理解图像分类决策和改进神经网络鲁棒性的对抗性解释（Adversarial Explanations for Understanding Image Classiﬁcation Decisions and Improved Neural Network Robustness ）

专知会员服务

14+阅读 · 2019年11月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《异构人类团队的协作决策过程混合建模研究》

《设计思维中的人机协作：生成式人工智能对共情访谈影响的探究》140页

面向国防作战的最佳自主与蜂群无人机技术

《C5ISR系统中的注意力动态与自适应决策支持研究：视觉与多模态注意力引导对任务绩效影响的递归量化分析》最新36页报告

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

NeurIPS'22上的GNN好文集合 (表示能力、架构设计、图对比/自监督学习、分布偏移、可解释、推荐系统等)

图与推荐

3+阅读 · 2022年9月20日

使用 Keras Tuner 调节超参数

使用 Keras Tuner 调节超参数

TensorFlow

15+阅读 · 2020年2月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

斯坦福大学Fall 2018课程-机器学习硬件加速器( 附PPT下载)

专知

18+阅读 · 2018年7月15日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

An Experimental Investigation of Tuning QUIC-Based Publish-Subscribe Architectures in IoT

Arxiv

0+阅读 · 2023年5月19日

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Arxiv

0+阅读 · 2023年5月19日

Scaling Up Dynamic Graph Representation Learning via Spiking Neural Networks

Arxiv

0+阅读 · 2023年5月18日

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

Arxiv

0+阅读 · 2023年5月18日

Less Can Be More: Unsupervised Graph Pruning for Large-scale Dynamic Graphs

Arxiv

0+阅读 · 2023年5月18日

Training Graph Neural Networks with 1000 Layers

Arxiv

13+阅读 · 2021年6月14日

Graph Neural Networks with Heterophily

Arxiv

19+阅读 · 2021年2月4日

Directional Graph Networks

Directional Graph Networks

Arxiv

27+阅读 · 2020年12月10日

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Arxiv

12+阅读 · 2020年6月24日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

相关基金

基于GPU的脉冲星宽带观测的相干消色散研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向通用计算集群的全局GPU虚拟化理论与方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

利用GPU实现大规模复杂体系反应分子动力学模拟的方法

国家自然科学基金

2+阅读 · 2012年12月31日

非饱和颗粒材料水力-力学耦合过程两尺度分析的二阶计算均匀化方法

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向众核体系结构的操作系统并行优化关键技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

天体测量中的高并行图像处理方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向NBTI的SOC芯片可靠性设计关键技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于人工神经网络的结核病疫情预测研究及软件实现

国家自然科学基金

1+阅读 · 2008年12月31日

移动终端有限资源下的三维图形处理方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员