Randomized Sketching is Robust to Low-Precision Rounding on GPUs - 专知论文

会员服务 ·

0

GPUs · 模型评估 · 稳健性 · 稀疏 · GPU ·

Randomized Sketching is Robust to Low-Precision Rounding on GPUs

翻译：暂无翻译

Aryaman Jeendgar,Clément Flint,Hartwig Anzt

from arxiv, 14 pages, 3 figures

Randomized sketching is a core primitive in randomized numerical linear algebra. On modern hardware architectures, in particular on GPUs, the performance of sparse sketches is limited by memory traffic and atomic accumulation rather than floating-point throughput. This makes sketching a natural target for mixed precision, provided that low-precision accumulation does not degrade the embedding quality. We study mixed-precision GPU implementations of sparse oblivious subspace embeddings, focusing on a SparseStack generalization of the GPU CountSketch kernel of Higgins et al. SparseStack improves embedding quality relative to CountSketch on coherent inputs, but its additional nonzeros per column increase atomic-update contention and reduce throughput. We therefore implement FP16 SparseStack variants using deterministic round-to-nearest, exact stochastic rounding, and dithered rounding, and compare them with FP32 SparseStack, CountSketch, mixed-precision CountSketch, and FlashSketch. Our main empirical finding is that, for the tested regimes, SparseStack embedding quality is insensitive to the FP16 rounding rule. Deterministic, stochastic, and dithered rounding FP16 SparseStack produce nearly identical subspace distortion and sketch-and-solve least-squares accuracy across incoherent, coherent, and adversarial test problems. The dominant accuracy factor is the sketch distribution rather than the quantization rule: SparseStack variants substantially improve distortion on coherent inputs, while all methods behave similarly on incoherent inputs. Since deterministic rounding has the lowest overhead, it provides the best performance--accuracy tradeoff among the FP16 SparseStack variants.

翻译：暂无翻译

0

相关内容

GPUs

《在军用边缘设备上部署生成式人工智能：攻克性能、安全与效能挑战》最新30页slides

《在军用边缘设备上部署生成式人工智能：攻克性能、安全与效能挑战》最新30页slides

专知会员服务

34+阅读 · 2025年11月19日

KDD25 | 大语言模型能否提高图神经网络的对抗鲁棒性？

KDD25 | 大语言模型能否提高图神经网络的对抗鲁棒性？

专知会员服务

19+阅读 · 2024年12月18日

面向多GPU的图神经网络训练加速

面向多GPU的图神经网络训练加速

专知会员服务

24+阅读 · 2023年1月19日

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

专知会员服务

46+阅读 · 2022年5月27日

「AI芯片/GPU/NPU/DSP专用处理器」技术特征比较分析最新2022综述论文

「AI芯片/GPU/NPU/DSP专用处理器」技术特征比较分析最新2022综述论文

专知会员服务

65+阅读 · 2022年3月6日

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

专知会员服务

91+阅读 · 2021年10月24日

处理器芯片敏捷设计方法：问题与挑战

专知会员服务

19+阅读 · 2021年6月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

精通ChatGPT等大模型，掌握最前沿技术，这有份绝佳资源

精通ChatGPT等大模型，掌握最前沿技术，这有份绝佳资源

机器之心

15+阅读 · 2023年4月12日

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

机器之心

15+阅读 · 2019年9月3日

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

机器之心

15+阅读 · 2019年7月13日

ICLR 2019计算机视觉、NLP、图模型、对抗学习、表示学习和元学习最新技术分享

ICLR 2019计算机视觉、NLP、图模型、对抗学习、表示学习和元学习最新技术分享

深度学习与NLP

17+阅读 · 2019年6月16日

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

GAN生成式对抗网络

14+阅读 · 2019年5月23日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

AI研习社

13+阅读 · 2018年8月24日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

三维堆叠DRAM的低功耗刷新技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

大型薄壁零件在机测量数据高效提取方法与关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

嵌入式异构多核系统应用程序自动并行化过程关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

面向存储受限应用的GPU性能预测模型和通信优化关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

低功耗数字化高集成度无线通信SoC芯片关键技术研究

国家自然科学基金

2+阅读 · 2014年12月31日

三维片上网络芯片关键设计技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

三维连续集成集成电路关键工艺技术和机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

40纳米工艺MOSFET器件毫米波建模和低功耗电路设计

国家自然科学基金

0+阅读 · 2014年12月31日

Skills for the future software profession: beyond agentic AI!

Arxiv

0+阅读 · 6月23日

Learning Expressive Random Feature Models via Parametrized Activations

Arxiv

0+阅读 · 6月19日

Learning-Based Modeling of Soft Robots via Cosserat Rod Theory

Arxiv

0+阅读 · 6月18日

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

Arxiv

0+阅读 · 6月18日

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Arxiv

0+阅读 · 6月18日

Robust and Interpretable Adaptation of Equivariant Materials Foundation Models via Sparsity-promoting Fine-tuning

Arxiv

0+阅读 · 6月17日

A skew polynomial framework for constructing division algebras and linear maximum rank distance codes

Arxiv

0+阅读 · 6月16日

From GPU to Microcontroller: Online Ridge Regression for Edge-Deployable Traffic Prediction

Arxiv

0+阅读 · 6月16日

A Differentiable GPU-Accelerated Finite Element Framework for Inverse Characterization of Finite-Strain Anisotropic Plasticity

Arxiv

0+阅读 · 6月16日

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

VIP会员

文章信息

相关主题

最新内容

无人机自主控制与人工智能：系统性综述

无人机自主控制与人工智能：系统性综述

专知会员服务

10+阅读 · 今天7:25

巡飞弹与反无人机系统——现代战场的两大支柱

巡飞弹与反无人机系统——现代战场的两大支柱

专知会员服务

3+阅读 · 今天6:54

《打造“黄金舰队”》57页报告

《打造“黄金舰队”》57页报告

专知会员服务

3+阅读 · 今天6:52

《北约数字教官网络发展路径》128页报告

《北约数字教官网络发展路径》128页报告

专知会员服务

2+阅读 · 今天6:33

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

专知会员服务

7+阅读 · 6月25日

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

专知会员服务

6+阅读 · 6月25日

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

9+阅读 · 6月25日

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

7+阅读 · 6月25日

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

8+阅读 · 6月25日

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

8+阅读 · 6月25日

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

10+阅读 · 6月25日

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

9+阅读 · 6月25日

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

9+阅读 · 6月25日

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

10+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

10+阅读 · 6月24日

相关VIP内容

《在军用边缘设备上部署生成式人工智能：攻克性能、安全与效能挑战》最新30页slides

《在军用边缘设备上部署生成式人工智能：攻克性能、安全与效能挑战》最新30页slides

专知会员服务

34+阅读 · 2025年11月19日

KDD25 | 大语言模型能否提高图神经网络的对抗鲁棒性？

KDD25 | 大语言模型能否提高图神经网络的对抗鲁棒性？

专知会员服务

19+阅读 · 2024年12月18日

面向多GPU的图神经网络训练加速

面向多GPU的图神经网络训练加速

专知会员服务

24+阅读 · 2023年1月19日

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

专知会员服务

46+阅读 · 2022年5月27日

「AI芯片/GPU/NPU/DSP专用处理器」技术特征比较分析最新2022综述论文

「AI芯片/GPU/NPU/DSP专用处理器」技术特征比较分析最新2022综述论文

专知会员服务

65+阅读 · 2022年3月6日

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

中科院计算所牵头发布《专⽤数据处理器DPU技术白皮书》，94页pdf

专知会员服务

91+阅读 · 2021年10月24日

处理器芯片敏捷设计方法：问题与挑战

专知会员服务

19+阅读 · 2021年6月29日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

巡飞弹与反无人机系统——现代战场的两大支柱

《北约数字教官网络发展路径》128页报告

无人机自主控制与人工智能：系统性综述

《打造“黄金舰队”》57页报告

相关资讯

精通ChatGPT等大模型，掌握最前沿技术，这有份绝佳资源

精通ChatGPT等大模型，掌握最前沿技术，这有份绝佳资源

机器之心

15+阅读 · 2023年4月12日

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

机器之心

15+阅读 · 2019年9月3日

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

参数少一半，效果还更好，天津大学和微软提出Transformer压缩模型

机器之心

15+阅读 · 2019年7月13日

ICLR 2019计算机视觉、NLP、图模型、对抗学习、表示学习和元学习最新技术分享

ICLR 2019计算机视觉、NLP、图模型、对抗学习、表示学习和元学习最新技术分享

深度学习与NLP

17+阅读 · 2019年6月16日

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

【学界】DeepMind论文：深度压缩感知，新框架提升GAN性能

GAN生成式对抗网络

14+阅读 · 2019年5月23日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

用 LDA 和 LSA 两种方法来降维和做 Topic 建模

AI研习社

13+阅读 · 2018年8月24日

From Softmax to Sparsemax-ICML16（1）

From Softmax to Sparsemax-ICML16（1）

KingsGarden

74+阅读 · 2016年11月26日

相关论文

Skills for the future software profession: beyond agentic AI!

Arxiv

0+阅读 · 6月23日

Learning Expressive Random Feature Models via Parametrized Activations

Arxiv

0+阅读 · 6月19日

Learning-Based Modeling of Soft Robots via Cosserat Rod Theory

Arxiv

0+阅读 · 6月18日

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

Arxiv

0+阅读 · 6月18日

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Arxiv

0+阅读 · 6月18日

Robust and Interpretable Adaptation of Equivariant Materials Foundation Models via Sparsity-promoting Fine-tuning

Arxiv

0+阅读 · 6月17日

A skew polynomial framework for constructing division algebras and linear maximum rank distance codes

Arxiv

0+阅读 · 6月16日

From GPU to Microcontroller: Online Ridge Regression for Edge-Deployable Traffic Prediction

Arxiv

0+阅读 · 6月16日

A Differentiable GPU-Accelerated Finite Element Framework for Inverse Characterization of Finite-Strain Anisotropic Plasticity

Arxiv

0+阅读 · 6月16日

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

相关基金

三维堆叠DRAM的低功耗刷新技术研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

大型薄壁零件在机测量数据高效提取方法与关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

嵌入式异构多核系统应用程序自动并行化过程关键技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

面向存储受限应用的GPU性能预测模型和通信优化关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

低功耗数字化高集成度无线通信SoC芯片关键技术研究

国家自然科学基金

2+阅读 · 2014年12月31日

三维片上网络芯片关键设计技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

三维连续集成集成电路关键工艺技术和机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

40纳米工艺MOSFET器件毫米波建模和低功耗电路设计

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员