PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications - 专知论文

会员服务 ·

0

核化 · GPU · 时间步 · MoDELS · 环 ·

2023 年 5 月 12 日

PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

翻译：PERKS：面向迭代内存受限GPU应用的局部性优化执行模型

Lingqi Zhang,Mohamed Wahib,Peng Chen,Jintao Meng,Xiao Wang,Toshio Endo,Satoshi Matsuoka

from arxiv, This paper will be published in 2023 International Conference on Supercomputing (ICS23)

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS.

翻译：迭代内存受限求解器常见于高性能计算代码中。典型的GPU实现方式是在主机端维护循环，每次时间步或算法步骤调用GPU内核。每个内核的终止隐式充当推进每个时间步求解后所需的屏障。我们提出了一种用于运行内存受限迭代GPU内核的执行模型：持久化内核（PERsistent KernelS，PERKS）。在此模型中，时间循环被移入持久化内核内部，并采用设备级屏障进行同步。随后，我们通过在每个时间步中将输出子集缓存至未使用的寄存器和共享内存中，减少了对设备内存的访问流量。PERKS可推广至任意迭代求解器：其设计在很大程度上独立于求解器的具体实现。我们阐述了PERKS的设计原理，并通过一系列二维/三维迭代模板基准测试（与现有最优库相比，二维模板几何平均加速比达2.12倍，三维模板达1.24倍）及Krylov子空间共轭梯度求解器（与现有最优库相比，SuiteSparse较小SpMV数据集的几何平均加速比达4.86倍，较大SpMV数据集达1.43倍）验证了其效果。所有基于PERKS的实现代码均可在https://github.com/neozhang307/PERKS 获取。

0

相关内容

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】用Python/OpenCV实现增强现实

【推荐】用Python/OpenCV实现增强现实

机器学习研究会

15+阅读 · 2017年11月16日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

围产期奶牛能量负平衡引发胰岛素抵抗的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

全空间中临界Surface Quasi-geostrophic方程的全局吸引子及其分形维数

国家自然科学基金

0+阅读 · 2014年12月31日

LOC283683-NIPA1-BMPRII途径对胆固醇平衡和动脉粥样硬化的影响及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

ATP激活血管内皮细胞P2Y2受体趋化巡逻型单核细胞稳定动脉粥样硬化斑块

国家自然科学基金

0+阅读 · 2013年12月31日

基于NIC的Exascale级计算机聚合通信卸载关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

lncRNA-UCA1通过PKM2参与膀胱癌细胞Warburg效应的机制

国家自然科学基金

0+阅读 · 2012年12月31日

基于PTP1B研究树豆酮酸A对2型糖尿病胰岛素与瘦素抵抗的作用

国家自然科学基金

0+阅读 · 2012年12月31日

基于ELAD和RNN的电动车用电动机运行效率快速优化关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米限域体系中杂多酸催化酮类Baeyer-Villiger氧化反应研究

国家自然科学基金

0+阅读 · 2011年12月31日

非定常流场自适应鲁棒降阶模型研究

国家自然科学基金

0+阅读 · 2009年12月31日

Localized implicit time stepping for the wave equation

Arxiv

0+阅读 · 2023年6月29日

An improved kernelization algorithm for Trivially Perfect Editing

Arxiv

0+阅读 · 2023年6月29日

DeciLS-PBO: an Effective Local Search Method for Pseudo-Boolean Optimization

Arxiv

0+阅读 · 2023年6月29日

1M parameters are enough? A lightweight CNN-based model for medical image segmentation

Arxiv

0+阅读 · 2023年6月28日

Generalizing to new geometries with Geometry-Aware Autoregressive Models (GAAMs) for fast calorimeter simulation

Arxiv

0+阅读 · 2023年6月28日

Balanced Encoding of Near-Zero Correlation for an AES Implementation

Arxiv

0+阅读 · 2023年6月27日

Subspace Recycling for Sequences of Shifted Systems with Applications in Image Recovery

Arxiv

0+阅读 · 2023年6月26日

Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures

Arxiv

0+阅读 · 2023年6月26日

ExSum: From Local Explanations to Model Understanding

Arxiv

13+阅读 · 2022年4月30日

On Neural Differential Equations

Arxiv

24+阅读 · 2022年2月4日

VIP会员

文章信息

相关主题

最新内容

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

1+阅读 · 今天2:06

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

1+阅读 · 今天1:37

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

2+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

2+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

3+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

6+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

6+阅读 · 6月17日

从燃煤战舰到算法战争：水面指挥的永恒要求

从燃煤战舰到算法战争：水面指挥的永恒要求

专知会员服务

3+阅读 · 6月17日

《短程弹道再入飞行器拦截时间中的一项异常现象》

《短程弹道再入飞行器拦截时间中的一项异常现象》

专知会员服务

4+阅读 · 6月17日

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

《基于回归方法与任务上下文的对抗环境动态战术网络报文优先级排序》

专知会员服务

4+阅读 · 6月17日

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

美智库《战术级指挥控制的迫切要求：构建弹性机动式指挥控制网络》报告

专知会员服务

4+阅读 · 6月17日

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

《韩国国防政策与军备出口：韩国安全与国防政策如何塑造其国防工业与军备出口格局》最新100页报告

专知会员服务

3+阅读 · 6月17日

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

ICML 2026 | VOTP：用视频基础模型与最优传输，让离线偏好强化学习只需少量反馈

专知会员服务

5+阅读 · 6月16日

多模态代码智能综述：从视觉输入到可执行代码系统

多模态代码智能综述：从视觉输入到可执行代码系统

专知会员服务

7+阅读 · 6月16日

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

美国马六甲“三重网”概念：安全网、威慑网与杀伤网

专知会员服务

6+阅读 · 6月16日

相关VIP内容

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

相关资讯

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】用Python/OpenCV实现增强现实

【推荐】用Python/OpenCV实现增强现实

机器学习研究会

15+阅读 · 2017年11月16日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

相关论文

Localized implicit time stepping for the wave equation

Arxiv

0+阅读 · 2023年6月29日

An improved kernelization algorithm for Trivially Perfect Editing

Arxiv

0+阅读 · 2023年6月29日

DeciLS-PBO: an Effective Local Search Method for Pseudo-Boolean Optimization

Arxiv

0+阅读 · 2023年6月29日

1M parameters are enough? A lightweight CNN-based model for medical image segmentation

Arxiv

0+阅读 · 2023年6月28日

Generalizing to new geometries with Geometry-Aware Autoregressive Models (GAAMs) for fast calorimeter simulation

Arxiv

0+阅读 · 2023年6月28日

Balanced Encoding of Near-Zero Correlation for an AES Implementation

Arxiv

0+阅读 · 2023年6月27日

Subspace Recycling for Sequences of Shifted Systems with Applications in Image Recovery

Arxiv

0+阅读 · 2023年6月26日

Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures

Arxiv

0+阅读 · 2023年6月26日

ExSum: From Local Explanations to Model Understanding

Arxiv

13+阅读 · 2022年4月30日

On Neural Differential Equations

Arxiv

24+阅读 · 2022年2月4日

相关基金

围产期奶牛能量负平衡引发胰岛素抵抗的分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

全空间中临界Surface Quasi-geostrophic方程的全局吸引子及其分形维数

国家自然科学基金

0+阅读 · 2014年12月31日

LOC283683-NIPA1-BMPRII途径对胆固醇平衡和动脉粥样硬化的影响及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

ATP激活血管内皮细胞P2Y2受体趋化巡逻型单核细胞稳定动脉粥样硬化斑块

国家自然科学基金

0+阅读 · 2013年12月31日

基于NIC的Exascale级计算机聚合通信卸载关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

lncRNA-UCA1通过PKM2参与膀胱癌细胞Warburg效应的机制

国家自然科学基金

0+阅读 · 2012年12月31日

基于PTP1B研究树豆酮酸A对2型糖尿病胰岛素与瘦素抵抗的作用

国家自然科学基金

0+阅读 · 2012年12月31日

基于ELAD和RNN的电动车用电动机运行效率快速优化关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米限域体系中杂多酸催化酮类Baeyer-Villiger氧化反应研究

国家自然科学基金

0+阅读 · 2011年12月31日

非定常流场自适应鲁棒降阶模型研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员