Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators - 专知论文

会员服务 ·

0

可移植性 · 可扩展性 · 扩展性 · 移植 · GNU ·

2023 年 4 月 9 日

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

翻译：最先进加速器上OpenMP卸载的可移植性与可扩展性

Yehonatan Fridman,Guy Tamir,Gal Oren

from arxiv, 13 pages

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs - the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs - were released to the market, with the oneAPI and GNU LLVM-backed compilation for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the potability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the vast majority of the offloading directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU compilers; however, the support in v5.1 and v5.2 is still lacking. From the performance perspective, we found that PVC is up to 37% better than the A100 on the LULESH benchmark, presenting better performance in computing and data movements.

翻译：过去十年间，计算性能的提升主要源于加速型众核架构的进步，尤以GPGPU形式最为突出。尽管加速器在各类计算任务中展现出卓越性能，但其应用仍需代码适配与转化。为此，作为科学计算应用中多线程处理最通用的标准，OpenMP从4.0版本起引入了主机（CPU）与加速器间的卸载能力，并在后续的4.5、5.0、5.1及最新5.2版本中持续增强支持。近期，两款最先进的GPU——Intel Ponte Vecchio Max 1100与NVIDIA A100 GPU已投放市场，分别借助oneAPI和GNU LLVM后端实现卸载编译。本研究针对以上设备，基于SOLLVE的OMPVV测试套件重点分析高级指令的可移植性，并通过代表性科学迷你应用（LULESH基准测试）评估硬件可扩展性，首次呈现OpenMP卸载能力的性能测试结果。结果表明：最新oneAPI和GNU编译器已支持v4.5和v5.0中绝大多数卸载指令，但对v5.1和v5.2的支持仍显不足。从性能角度而言，PVC在LULESH基准测试中较A100性能提升高达37%，在计算与数据迁移方面均表现更优。

0

相关内容

可移植性

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

专知会员服务

11+阅读 · 2022年10月20日

【ICML2022】基于自适应上下文池化的高效表示学习

【ICML2022】基于自适应上下文池化的高效表示学习

专知会员服务

20+阅读 · 2022年7月9日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

专知会员服务

34+阅读 · 2021年3月25日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

InfoQ

0+阅读 · 2022年11月20日

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

机器之心

0+阅读 · 2022年10月7日

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

机器之心

0+阅读 · 2022年7月5日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

CALDERA 一款对手自动模拟工具

CALDERA 一款对手自动模拟工具

黑白之道

20+阅读 · 2019年9月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

泡泡机器人SLAM

17+阅读 · 2019年5月10日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

基于编译的PCM内存损耗均衡方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向CFD并行应用开发框架的高效容错方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

通用异构并行密度泛函计算方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向高精度计算领域动态可配置加速器体系结构研究

国家自然科学基金

0+阅读 · 2013年12月31日

多核平台上的BESIII离线物理软件与调度策略研究

国家自然科学基金

0+阅读 · 2012年12月31日

众核平台的并行编程模型及其运行时支持技术的研究

国家自然科学基金

0+阅读 · 2012年12月31日

大规模计算网络并行任务调度模型及其参数方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

计算力学基本计算及可视化工具程序包的开发与集成

国家自然科学基金

2+阅读 · 2012年12月31日

跨平台的操作系统安全机制形式化验证方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于合成基准测试程序的多核处理器模拟技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

SLA Management in Intent-Driven Service Management Systems: A Taxonomy and Future Directions

Arxiv

0+阅读 · 2023年5月26日

InstaGrasp: An Entirely 3D Printed Adaptive Gripper with TPU Soft Elements and Minimal Assembly Time

Arxiv

0+阅读 · 2023年5月26日

The Power of Linear Recurrent Neural Networks

Arxiv

0+阅读 · 2023年5月25日

ACAI: Extending Arm Confidential Computing Architecture Protection from CPUs to Accelerators

Arxiv

0+阅读 · 2023年5月25日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions

Arxiv

43+阅读 · 2022年6月15日

Self-Supervised Learning for Recommender Systems: A Survey

Arxiv

12+阅读 · 2022年3月29日

A Survey on Neural Speech Synthesis

Arxiv

14+阅读 · 2021年6月30日

Learning from Very Few Samples: A Survey

Arxiv

126+阅读 · 2020年9月6日

A Survey on Edge Computing Systems and Tools

Arxiv

37+阅读 · 2019年11月7日

VIP会员

文章信息

相关主题

最新内容

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

3+阅读 · 今天6:30

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

4+阅读 · 今天6:18

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

4+阅读 · 今天6:08

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

4+阅读 · 今天5:54

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

4+阅读 · 今天5:22

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

5+阅读 · 今天5:15

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

5+阅读 · 今天3:42

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

4+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

3+阅读 · 6月24日

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

专知会员服务

9+阅读 · 6月24日

重新思考无人机时代的生存能力

重新思考无人机时代的生存能力

专知会员服务

8+阅读 · 6月24日

装甲突击旅：现代战争思考、战斗与组织

装甲突击旅：现代战争思考、战斗与组织

专知会员服务

6+阅读 · 6月24日

在人工智能加速决策环境中拓展OODA循环

在人工智能加速决策环境中拓展OODA循环

专知会员服务

8+阅读 · 6月24日

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰与伊朗案例研究》

专知会员服务

7+阅读 · 6月24日

军事欺骗：供作战战术指挥官使用的工具

军事欺骗：供作战战术指挥官使用的工具

专知会员服务

6+阅读 · 6月24日

相关VIP内容

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

专知会员服务

11+阅读 · 2022年10月20日

【ICML2022】基于自适应上下文池化的高效表示学习

【ICML2022】基于自适应上下文池化的高效表示学习

专知会员服务

20+阅读 · 2022年7月9日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

专知会员服务

34+阅读 · 2021年3月25日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

网状网络及其在军事领域的运用

无美国参与的欧洲战争方式（万字长文）

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

相关资讯

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

InfoQ

0+阅读 · 2022年11月20日

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

机器之心

0+阅读 · 2022年10月7日

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

机器之心

0+阅读 · 2022年7月5日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

CALDERA 一款对手自动模拟工具

CALDERA 一款对手自动模拟工具

黑白之道

20+阅读 · 2019年9月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

泡泡机器人SLAM

17+阅读 · 2019年5月10日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

相关论文

SLA Management in Intent-Driven Service Management Systems: A Taxonomy and Future Directions

Arxiv

0+阅读 · 2023年5月26日

InstaGrasp: An Entirely 3D Printed Adaptive Gripper with TPU Soft Elements and Minimal Assembly Time

Arxiv

0+阅读 · 2023年5月26日

The Power of Linear Recurrent Neural Networks

Arxiv

0+阅读 · 2023年5月25日

ACAI: Extending Arm Confidential Computing Architecture Protection from CPUs to Accelerators

Arxiv

0+阅读 · 2023年5月25日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions

Arxiv

43+阅读 · 2022年6月15日

Self-Supervised Learning for Recommender Systems: A Survey

Arxiv

12+阅读 · 2022年3月29日

A Survey on Neural Speech Synthesis

Arxiv

14+阅读 · 2021年6月30日

Learning from Very Few Samples: A Survey

Arxiv

126+阅读 · 2020年9月6日

A Survey on Edge Computing Systems and Tools

Arxiv

37+阅读 · 2019年11月7日

相关基金

基于编译的PCM内存损耗均衡方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向CFD并行应用开发框架的高效容错方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

通用异构并行密度泛函计算方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向高精度计算领域动态可配置加速器体系结构研究

国家自然科学基金

0+阅读 · 2013年12月31日

多核平台上的BESIII离线物理软件与调度策略研究

国家自然科学基金

0+阅读 · 2012年12月31日

众核平台的并行编程模型及其运行时支持技术的研究

国家自然科学基金

0+阅读 · 2012年12月31日

大规模计算网络并行任务调度模型及其参数方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

计算力学基本计算及可视化工具程序包的开发与集成

国家自然科学基金

2+阅读 · 2012年12月31日

跨平台的操作系统安全机制形式化验证方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于合成基准测试程序的多核处理器模拟技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员