SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines - 专知论文

会员服务 ·

0

近似 · 模型评估 · 值域 · Engineering · 缩放 ·

SGEMM-cube: Precision-Recovery FP32 GEMM Approximation on Ascend NPUs with FP16 Matrix Engines

翻译：暂无翻译

Weicheng Xue,Baisong Xu,Kai Yang,Yongxiang Liu,Dengdeng Fan,Pengxiang Xu,Yonghong Tian

Modern AI accelerators provide high-throughput low-precision matrix engines, but their support for FP32 GEMM is often limited or inefficient. This work presents SGEMM-cube, a precision-recovery FP32 GEMM approximation on Ascend NPUs using FP16 Cube units. Rather than claiming bit-exact FP32 approximation, SGEMM-cube targets near-FP32 accuracy for inputs whose magnitudes are representable within the FP16 dynamic range. The method follows a two-component FP32-to-FP16 splitting strategy related to Ozaki-style and Ootomo-style schemes: each FP32 operand is represented by an FP16 high component and a scaled FP16 residual component, and the matrix product is reconstructed from the dominant high-high and high-low terms while omitting the low-low term. The main contribution of this paper is not a new splitting paradigm, but an architecture-specific realization and analysis of this precision-recovery scheme on Ascend NPUs. We analyze the effects of round-to-nearest conversion, underflow, residual scaling, and accumulation order under the Ascend execution model, and clarify the range and accuracy limitations of the approach. We further adapt standard high-performance GEMM techniques, including L1-aware blocking and double-buffered pipelining, to the software-managed memory hierarchy of Ascend NPUs. Experiments on Ascend 910A show that SGEMM-cube recovers substantially higher accuracy than native FP16 GEMM and approaches FP32 SGEMM accuracy for moderate-range inputs, while achieving up to 65.3 TFLOP/s, corresponding to 77\% of the FP32-equivalent peak defined by the three-GEMM decomposition cost. These results demonstrate that FP32-accuracy GEMM approximation can be made practical on FP16-only NPU matrix engines, provided that its range, error, and implementation constraints are explicitly managed.

翻译：暂无翻译

0

相关内容

CVPR 2026 Findings | 算力砍半、性能不降！全开源 A₁模型：让机器人大模型真正走向落地

CVPR 2026 Findings | 算力砍半、性能不降！全开源 A₁模型：让机器人大模型真正走向落地

专知会员服务

15+阅读 · 4月12日

《用于适应性、任务就绪型军用仿生机器人的合成数据管道》

《用于适应性、任务就绪型军用仿生机器人的合成数据管道》

专知会员服务

21+阅读 · 2025年12月29日

《机器人弹性物体感知技术研究》227页

《机器人弹性物体感知技术研究》227页

专知会员服务

18+阅读 · 2025年11月20日

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

专知会员服务

9+阅读 · 2025年4月15日

国产大模型DeepSeek-V3一夜火爆全球，《DeepSeek-V3技术报告》，53页pdf

国产大模型DeepSeek-V3一夜火爆全球，《DeepSeek-V3技术报告》，53页pdf

专知会员服务

88+阅读 · 2024年12月27日

[ICML2022] NeuroFluid: 流体仿真的人工智能新范式

[ICML2022] NeuroFluid: 流体仿真的人工智能新范式

专知会员服务

27+阅读 · 2022年6月8日

Nat. Mach. Intel.| 自动因果推理在随机对照临床试验中的应用

Nat. Mach. Intel.| 自动因果推理在随机对照临床试验中的应用

专知会员服务

12+阅读 · 2022年6月3日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

处理器芯片敏捷设计方法：问题与挑战

专知会员服务

19+阅读 · 2021年6月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【泡泡图灵智库】基于上采样预积分测量值的3D Lidar-IMU校准来矫正运动失真

【泡泡图灵智库】基于上采样预积分测量值的3D Lidar-IMU校准来矫正运动失真

泡泡机器人SLAM

11+阅读 · 2019年9月17日

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

机器之心

15+阅读 · 2019年7月13日

MaskFusion: 多运动目标实时识别、跟踪和重建

MaskFusion: 多运动目标实时识别、跟踪和重建

计算机视觉life

11+阅读 · 2019年4月20日

【泡泡图灵智库】GCNv2：高效关联预测实时SLAM（arXiv）

【泡泡图灵智库】GCNv2：高效关联预测实时SLAM（arXiv）

泡泡机器人SLAM

45+阅读 · 2019年4月15日

【泡泡图灵智库】SGPN：用于3D点云实例分割的相似群建议网络（CVPR）

【泡泡图灵智库】SGPN：用于3D点云实例分割的相似群建议网络（CVPR）

泡泡机器人SLAM

15+阅读 · 2019年1月21日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【泡泡图灵智库】HSfM: 混合运动恢复结构（CVPR）

【泡泡图灵智库】HSfM: 混合运动恢复结构（CVPR）

泡泡机器人SLAM

11+阅读 · 2018年12月13日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

纳米尺度自旋电子器件参数化电路模型建立方法的研究

国家自然科学基金

0+阅读 · 2017年12月31日

高精度片上抖动测量关键技术及电路实现研究

国家自然科学基金

0+阅读 · 2015年12月31日

逆模型在线调整的两电机同步系统低损耗解耦控制

国家自然科学基金

0+阅读 · 2015年12月31日

大规模MIMO-OFDM系统中的同相/正交支路不平衡问题及其补偿方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

吸收倍增分离型波导Si/Ge雪崩探测器件的研究

国家自然科学基金

0+阅读 · 2015年12月31日

超高速CMOS数模转换器关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

微陀螺片上后台自校准与自补偿关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

40纳米工艺MOSFET器件毫米波建模和低功耗电路设计

国家自然科学基金

0+阅读 · 2014年12月31日

压电智能作动器的高保真完整非线性动力学建模和高精度多通道运动协同同步控制系统一体化优化设计

国家自然科学基金

0+阅读 · 2014年12月31日

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

Arxiv

0+阅读 · 6月16日

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Arxiv

0+阅读 · 6月11日

Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

Arxiv

0+阅读 · 6月3日

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Arxiv

0+阅读 · 6月3日

Graph Attention-Based Virtual Metrology for Film Deposition Processes in Semiconductor Manufacturing

Arxiv

0+阅读 · 5月30日

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Arxiv

0+阅读 · 5月24日

Emerging memory technologies at room/cryogenic temperature

Arxiv

0+阅读 · 5月21日

Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

Arxiv

0+阅读 · 5月15日

cuRegOT: A GPU-Accelerated Solver for Entropic-Regularized Optimal Transport

Arxiv

0+阅读 · 5月9日

TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines

Arxiv

0+阅读 · 5月8日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

2+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

4+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

5+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

6+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

11+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

10+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

6+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

9+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

7+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

13+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

8+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

6+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

8+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

8+阅读 · 6月17日

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

10+阅读 · 6月17日

相关VIP内容

CVPR 2026 Findings | 算力砍半、性能不降！全开源 A₁模型：让机器人大模型真正走向落地

CVPR 2026 Findings | 算力砍半、性能不降！全开源 A₁模型：让机器人大模型真正走向落地

专知会员服务

15+阅读 · 4月12日

《用于适应性、任务就绪型军用仿生机器人的合成数据管道》

《用于适应性、任务就绪型军用仿生机器人的合成数据管道》

专知会员服务

21+阅读 · 2025年12月29日

《机器人弹性物体感知技术研究》227页

《机器人弹性物体感知技术研究》227页

专知会员服务

18+阅读 · 2025年11月20日

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

专知会员服务

9+阅读 · 2025年4月15日

国产大模型DeepSeek-V3一夜火爆全球，《DeepSeek-V3技术报告》，53页pdf

国产大模型DeepSeek-V3一夜火爆全球，《DeepSeek-V3技术报告》，53页pdf

专知会员服务

88+阅读 · 2024年12月27日

[ICML2022] NeuroFluid: 流体仿真的人工智能新范式

[ICML2022] NeuroFluid: 流体仿真的人工智能新范式

专知会员服务

27+阅读 · 2022年6月8日

Nat. Mach. Intel.| 自动因果推理在随机对照临床试验中的应用

Nat. Mach. Intel.| 自动因果推理在随机对照临床试验中的应用

专知会员服务

12+阅读 · 2022年6月3日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

处理器芯片敏捷设计方法：问题与挑战

专知会员服务

19+阅读 · 2021年6月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

相关资讯

【泡泡图灵智库】基于上采样预积分测量值的3D Lidar-IMU校准来矫正运动失真

【泡泡图灵智库】基于上采样预积分测量值的3D Lidar-IMU校准来矫正运动失真

泡泡机器人SLAM

11+阅读 · 2019年9月17日

初学者系列：Attentional Factorization Machines（AFM）详解

初学者系列：Attentional Factorization Machines（AFM）详解

专知

82+阅读 · 2019年9月16日

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

机器之心

15+阅读 · 2019年7月13日

MaskFusion: 多运动目标实时识别、跟踪和重建

MaskFusion: 多运动目标实时识别、跟踪和重建

计算机视觉life

11+阅读 · 2019年4月20日

【泡泡图灵智库】GCNv2：高效关联预测实时SLAM（arXiv）

【泡泡图灵智库】GCNv2：高效关联预测实时SLAM（arXiv）

泡泡机器人SLAM

45+阅读 · 2019年4月15日

【泡泡图灵智库】SGPN：用于3D点云实例分割的相似群建议网络（CVPR）

【泡泡图灵智库】SGPN：用于3D点云实例分割的相似群建议网络（CVPR）

泡泡机器人SLAM

15+阅读 · 2019年1月21日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【泡泡图灵智库】HSfM: 混合运动恢复结构（CVPR）

【泡泡图灵智库】HSfM: 混合运动恢复结构（CVPR）

泡泡机器人SLAM

11+阅读 · 2018年12月13日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

Arxiv

0+阅读 · 6月16日

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Arxiv

0+阅读 · 6月11日

Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

Arxiv

0+阅读 · 6月3日

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Arxiv

0+阅读 · 6月3日

Graph Attention-Based Virtual Metrology for Film Deposition Processes in Semiconductor Manufacturing

Arxiv

0+阅读 · 5月30日

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Arxiv

0+阅读 · 5月24日

Emerging memory technologies at room/cryogenic temperature

Arxiv

0+阅读 · 5月21日

Ascend-RaBitQ: Heterogeneous NPU-CPU Acceleration of Billion-Scale Similarity Search with 1-bit Quantization

Arxiv

0+阅读 · 5月15日

cuRegOT: A GPU-Accelerated Solver for Entropic-Regularized Optimal Transport

Arxiv

0+阅读 · 5月9日

TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines

Arxiv

0+阅读 · 5月8日

相关基金

纳米尺度自旋电子器件参数化电路模型建立方法的研究

国家自然科学基金

0+阅读 · 2017年12月31日

高精度片上抖动测量关键技术及电路实现研究

国家自然科学基金

0+阅读 · 2015年12月31日

逆模型在线调整的两电机同步系统低损耗解耦控制

国家自然科学基金

0+阅读 · 2015年12月31日

大规模MIMO-OFDM系统中的同相/正交支路不平衡问题及其补偿方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

吸收倍增分离型波导Si/Ge雪崩探测器件的研究

国家自然科学基金

0+阅读 · 2015年12月31日

超高速CMOS数模转换器关键技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

微陀螺片上后台自校准与自补偿关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

40纳米工艺MOSFET器件毫米波建模和低功耗电路设计

国家自然科学基金

0+阅读 · 2014年12月31日

压电智能作动器的高保真完整非线性动力学建模和高精度多通道运动协同同步控制系统一体化优化设计

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员