Flex-PE：面向AI工作负载的灵活SIMD多精度处理单元 (Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads)

The rapid adaptation of data driven AI models, such as deep learning inference, training, Vision Transformers (ViTs), and other HPC applications, drives a strong need for runtime precision configurable different non linear activation functions (AF) hardware support. Existing solutions support diverse precision or runtime AF reconfigurability but fail to address both simultaneously. This work proposes a flexible and SIMD multiprecision processing element (FlexPE), which supports diverse runtime configurable AFs, including sigmoid, tanh, ReLU and softmax, and MAC operation. The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware. This work proposes an area efficient multiprecision iterative mode in the SIMD systolic arrays for edge AI use cases. The design delivers superior performance with up to 62X and 371X reductions in DMA reads for input feature maps and weight filters in VGG16, with an energy efficiency of 8.42 GOPS / W within the accuracy loss of 2%. The proposed architecture supports emerging 4-bit computations for DL inference while enhancing throughput in FxP8/16 modes for transformers and other HPC applications. The proposed approach enables future energy-efficient AI accelerators in edge and cloud environments.

翻译：数据驱动的AI模型（如深度学习推理、训练、视觉Transformer（ViT）及其他高性能计算应用）的快速适配，催生了对运行时精度可配置的不同非线性激活函数硬件支持的强烈需求。现有方案支持多样化的精度或运行时激活函数可重构性，但未能同时兼顾两者。本文提出一种灵活的SIMD多精度处理单元（Flex-PE），其支持多种运行时可配置的激活函数（包括sigmoid、tanh、ReLU和softmax）以及乘累加运算。所提设计在流水线模式下采用100%时间复用的硬件，实现了高达16倍FxP4、8倍FxP8、4倍FxP16及1倍FxP32的吞吐量提升。本文针对边缘AI应用场景，在SIMD脉动阵列中提出一种面积高效的多精度迭代模式。该设计在VGG16中实现了输入特征图和权重滤波器的DMA读取次数分别最高减少62倍和371倍的卓越性能，并在精度损失2%的范围内达到8.42 GOPS/W的能效。所提架构支持新兴的4位深度学习推理计算，同时增强了FxP8/16模式下Transformer及其他高性能计算应用的吞吐量。该方案为未来边缘与云端环境中的高能效AI加速器提供了实现路径。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日