基于移位感知的动态对齐尾数位宽预测的数字存内计算中FP8计算精度与效率的平衡 (Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction) - 专知论文

会员服务 ·

0

精度 · 对齐 · 单元 · 阵列 · 计算加速 ·

Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

翻译：基于移位感知的动态对齐尾数位宽预测的数字存内计算中FP8计算精度与效率的平衡

Liang Zhao,Kunming Shao,Zhipeng Liao,Xijie Huang,Tim Kwang-Ting Cheng,Chi-Ying Tsui,Yi Zou

from arxiv, This paper is under review by IEEE Transactions On Very Large Scale Integration Systems (TVLSI-00144-2026)

FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2$\sim$12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64$\times$96 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8$\times$ higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.

翻译：FP8低精度格式在Transformer推理与训练中已获得广泛应用。然而，现有数字存内计算架构在支持可变FP8对齐尾数位宽方面面临挑战，因为统一的对齐策略和固定精度的乘累加单元难以处理具有多样化分布的输入数据。本研究提出了一种灵活的FP8数字存内计算加速器，包含三项创新：(1) 动态移位感知位宽预测技术，通过实时输入预测自适应调整权重(2/4/6/8b)和输入(2$\sim$12b)的对齐尾数精度；(2) 基于FIFO的输入对齐单元，用基于指针的控制取代复杂的桶形移位器；(3) 精度可扩展的INT乘累加阵列，以最小开销实现灵活的权重精度。该设计采用28nm CMOS工艺实现，集成64$\times$96存内计算阵列，在固定E5M7格式下达到20.4 TFLOPS/W能效，相比先前工作将FP8效率提升2.8$\times$，同时支持所有FP8格式。在Llama-7b模型上的实验表明，在BoolQ和Winogrande数据集上，动态移位感知位宽预测在相同精度水平下比固定位宽模式获得更高效率，其可配置参数支持灵活的精度-效率权衡。

0

相关内容

【ICML2025】FG-CLIP：细粒度视觉与文本对齐

【ICML2025】FG-CLIP：细粒度视觉与文本对齐

专知会员服务

11+阅读 · 2025年5月9日

中科院计算所最新「视觉Transformer」综述论文，带你全面了解最新CV分类、检测/分割方法

中科院计算所最新「视觉Transformer」综述论文，带你全面了解最新CV分类、检测/分割方法

专知会员服务

99+阅读 · 2021年11月16日

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【ICML2021】具有线性复杂度的Transformer的相对位置编码

【ICML2021】具有线性复杂度的Transformer的相对位置编码

专知会员服务

25+阅读 · 2021年5月20日

【CVPR 2020-商汤】8比特数值也能训练卷积神经网络模型

【CVPR 2020-商汤】8比特数值也能训练卷积神经网络模型

专知会员服务

26+阅读 · 2020年5月7日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

专知会员服务

70+阅读 · 2020年1月17日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【ChatGPT系列报告】ChatGPT：存算一体，算力的下一极，47页ppt

【ChatGPT系列报告】ChatGPT：存算一体，算力的下一极，47页ppt

专知

14+阅读 · 2023年4月6日

推荐！《在美海军部内采用数字孪生》2022最新126页论文，美国海军研究生院

推荐！《在美海军部内采用数字孪生》2022最新126页论文，美国海军研究生院

专知

23+阅读 · 2022年10月27日

《“边缘计算+”技术白皮书》，82页pdf

《“边缘计算+”技术白皮书》，82页pdf

专知

11+阅读 · 2022年8月28日

综述：军事应用中使用的一些重要算法

综述：军事应用中使用的一些重要算法

专知

12+阅读 · 2022年7月3日

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

泡泡机器人SLAM

17+阅读 · 2019年7月8日

CVPR 2019 Oral 论文解读 | 百度提出关于网络压缩和加速的新剪枝算法

CVPR 2019 Oral 论文解读 | 百度提出关于网络压缩和加速的新剪枝算法

AI科技评论

11+阅读 · 2019年5月28日

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

未来产业促进会

18+阅读 · 2019年3月10日

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

新智元

11+阅读 · 2019年3月8日

论文浅尝 | 基于知识图谱嵌入的 Bootstrapping 实体对齐方法

论文浅尝 | 基于知识图谱嵌入的 Bootstrapping 实体对齐方法

开放知识图谱

17+阅读 · 2019年1月5日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

高容错能力的阵列纠删码模型研究

国家自然科学基金

2+阅读 · 2015年12月31日

低信噪比条件下数字通信系统码辅助同步技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

高速率、高频谱效率码分多址系统地址码设计研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于矩量法与渐近波形估计技术的动态海面宽带电磁散射特性研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于压缩感知的信号重建快速算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于机器视觉的索缆六自由度位移测量方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

普适计算对象感知多模态不精确性数据融合算法研究

国家自然科学基金

5+阅读 · 2014年12月31日

提高移动最小二乘近似无网格方法计算效率的技术和理论

国家自然科学基金

0+阅读 · 2014年12月31日

对称锥互补问题的算法研究及其在压缩感知中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

面向类脑计算存储器的调制编码理论及方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

Pareto-optimal Trade-offs Between Communication and Computation with Flexible Gradient Tracking

Arxiv

0+阅读 · 2月15日

MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

Arxiv

0+阅读 · 2月12日

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

Arxiv

0+阅读 · 2月11日

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

Arxiv

0+阅读 · 2月5日

Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

Arxiv

0+阅读 · 1月30日

Rate-Distortion Optimization for Transformer Inference

Arxiv

0+阅读 · 1月29日

CIM-Tuner: Balancing the Compute and Storage Capacity of SRAM-CIM Accelerator via Hardware-mapping Co-exploration

Arxiv

0+阅读 · 1月26日

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Arxiv

0+阅读 · 1月22日

Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement

Arxiv

0+阅读 · 1月15日

CascadeInfer: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling

Arxiv

0+阅读 · 1月14日

VIP会员

文章信息

相关主题

相关VIP内容

【ICML2025】FG-CLIP：细粒度视觉与文本对齐

【ICML2025】FG-CLIP：细粒度视觉与文本对齐

专知会员服务

11+阅读 · 2025年5月9日

中科院计算所最新「视觉Transformer」综述论文，带你全面了解最新CV分类、检测/分割方法

中科院计算所最新「视觉Transformer」综述论文，带你全面了解最新CV分类、检测/分割方法

专知会员服务

99+阅读 · 2021年11月16日

【ICCV 2021 】Vision Transformer中的相对位置编码

专知会员服务

30+阅读 · 2021年7月30日

【ICML2021】具有线性复杂度的Transformer的相对位置编码

【ICML2021】具有线性复杂度的Transformer的相对位置编码

专知会员服务

25+阅读 · 2021年5月20日

【CVPR 2020-商汤】8比特数值也能训练卷积神经网络模型

【CVPR 2020-商汤】8比特数值也能训练卷积神经网络模型

专知会员服务

26+阅读 · 2020年5月7日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

Google AI博客解读论文《Reformer: The Efficient Transformer》，百万量级注意力机制

专知会员服务

70+阅读 · 2020年1月17日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

智能体记忆深度剖析：评价指标与系统局限性的分类体系及实证分析

《可信人工智能赋能系统的支柱》

【CMU博士论文】可靠轨迹预测的分层基石：数据、评估与方法

人工智能赋能边缘与自主系统：美陆军现代化进程聚焦威胁探测与战术边缘情报

相关资讯

【ChatGPT系列报告】ChatGPT：存算一体，算力的下一极，47页ppt

【ChatGPT系列报告】ChatGPT：存算一体，算力的下一极，47页ppt

专知

14+阅读 · 2023年4月6日

推荐！《在美海军部内采用数字孪生》2022最新126页论文，美国海军研究生院

推荐！《在美海军部内采用数字孪生》2022最新126页论文，美国海军研究生院

专知

23+阅读 · 2022年10月27日

《“边缘计算+”技术白皮书》，82页pdf

《“边缘计算+”技术白皮书》，82页pdf

专知

11+阅读 · 2022年8月28日

综述：军事应用中使用的一些重要算法

综述：军事应用中使用的一些重要算法

专知

12+阅读 · 2022年7月3日

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

【泡泡点云时空】跟踪与三角测量中一种通过兴趣点网络进行多视图2D/3D刚性配准的方法

泡泡机器人SLAM

17+阅读 · 2019年7月8日

CVPR 2019 Oral 论文解读 | 百度提出关于网络压缩和加速的新剪枝算法

CVPR 2019 Oral 论文解读 | 百度提出关于网络压缩和加速的新剪枝算法

AI科技评论

11+阅读 · 2019年5月28日

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

未来产业促进会

18+阅读 · 2019年3月10日

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

CVPR 2019：中科院、牛津等提出SiamMask网络，视频跟踪最高精度

新智元

11+阅读 · 2019年3月8日

论文浅尝 | 基于知识图谱嵌入的 Bootstrapping 实体对齐方法

论文浅尝 | 基于知识图谱嵌入的 Bootstrapping 实体对齐方法

开放知识图谱

17+阅读 · 2019年1月5日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

相关论文

Pareto-optimal Trade-offs Between Communication and Computation with Flexible Gradient Tracking

Arxiv

0+阅读 · 2月15日

MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

Arxiv

0+阅读 · 2月12日

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

Arxiv

0+阅读 · 2月11日

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

Arxiv

0+阅读 · 2月5日

Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

Arxiv

0+阅读 · 1月30日

Rate-Distortion Optimization for Transformer Inference

Arxiv

0+阅读 · 1月29日

CIM-Tuner: Balancing the Compute and Storage Capacity of SRAM-CIM Accelerator via Hardware-mapping Co-exploration

Arxiv

0+阅读 · 1月26日

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Arxiv

0+阅读 · 1月22日

Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement

Arxiv

0+阅读 · 1月15日

CascadeInfer: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling

Arxiv

0+阅读 · 1月14日

相关基金

高容错能力的阵列纠删码模型研究

国家自然科学基金

2+阅读 · 2015年12月31日

低信噪比条件下数字通信系统码辅助同步技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

高速率、高频谱效率码分多址系统地址码设计研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于矩量法与渐近波形估计技术的动态海面宽带电磁散射特性研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于压缩感知的信号重建快速算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于机器视觉的索缆六自由度位移测量方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

普适计算对象感知多模态不精确性数据融合算法研究

国家自然科学基金

5+阅读 · 2014年12月31日

提高移动最小二乘近似无网格方法计算效率的技术和理论

国家自然科学基金

0+阅读 · 2014年12月31日

对称锥互补问题的算法研究及其在压缩感知中的应用

国家自然科学基金

0+阅读 · 2014年12月31日

面向类脑计算存储器的调制编码理论及方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员