Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes - 专知论文

会员服务 ·

0

Performer · MoDELS · CUDA · 可辨认的 · 英伟达（NVIDIA） ·

2023 年 3 月 10 日

Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes

翻译：评估高性能编程模型在百亿亿次节点上的性能与可移植性：Julia、Python/Numba与Kokkos

William F. Godoy,Pedro Valero-Lara,T. Elise Dettling,Christian Trefftz,Ian Jorquera,Thomas Sheehy,Ross G. Miller,Marc Gonzalez-Tallada,Jeffrey S. Vetter,Valentin Churavy

from arxiv, Accepted at the 28th HIPS workshop, held in conjunction with IPDPS 2023. 10 pages, 9 figures

We explore the performance and portability of the high-level programming models: the LLVM-based Julia and Python/Numba, and Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs and MI250X graphical processing units (GPUs) on Frontier's test bed Crusher system and Ampere's Arm-based CPUs and NVIDIA's A100 GPUs on the Wombat system at the Oak Ridge Leadership Computing Facilities. We compare the default performance of a hand-rolled dense matrix multiplication algorithm on CPUs against vendor-compiled C/OpenMP implementations, and on each GPU against CUDA and HIP. Rather than focusing on the kernel optimization per-se, we select this naive approach to resemble exploratory work in science and as a lower-bound for performance to isolate the effect of each programming model. Julia and Kokkos perform comparably with C/OpenMP on CPUs, while Julia implementations are competitive with CUDA and HIP on GPUs. Performance gaps are identified on NVIDIA A100 GPUs for Julia's single precision and Kokkos, and for Python/Numba in all scenarios. We also comment on half-precision support, productivity, performance portability metrics, and platform readiness. We expect to contribute to the understanding and direction for high-level, high-productivity languages in HPC as the first-generation exascale systems are deployed.

翻译：我们探索了高性能计算（HPC）节点上多种高级编程模型的性能与可移植性：基于LLVM的Julia和Python/Numba，以及Kokkos。测试平台包括：采用AMD Epyc CPU和MI250X图形处理器（GPU）的Frontier测试床Crusher系统，以及橡树岭领导计算设施Wombat系统上基于安培Arm架构的CPU和NVIDIA A100 GPU。我们比较了手动实现稠密矩阵乘法算法在CPU上相较于厂商编译的C/OpenMP实现的默认性能，并在各GPU上将其与CUDA和HIP进行对比。我们并未专注于内核优化本身，而是选取这种朴素方法以模拟科学探索中的初始研究，并将其作为性能下界来隔离每种编程模型的影响。Julia和Kokkos在CPU上的性能与C/OpenMP相当，而Julia实现则在GPU上可与CUDA和HIP竞争。在NVIDIA A100 GPU上，Julia单精度计算和Kokkos存在性能差距，Python/Numba在所有场景中均有差距。我们还评述了半精度支持、生产力、性能可移植性指标及平台就绪度。我们期望此项工作能为第一代百亿亿次系统部署背景下HPC高级别、高生产力语言的理解与发展方向做出贡献。

0

相关内容

Performer

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

254+阅读 · 2020年4月19日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

谷歌足球游戏环境使用介绍

谷歌足球游戏环境使用介绍

CreateAMind

33+阅读 · 2019年6月27日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

Alpha稳定分布环境下的非圆信号波达方向估计方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

高稳定性纳米复合多层膜钨基块材的制备和抗辐照性能研究

国家自然科学基金

0+阅读 · 2014年12月31日

不锈钢表面“膨胀”α相层形成与强韧化机理

国家自然科学基金

0+阅读 · 2013年12月31日

声矢量传感器阵列在非理想传输条件下的声源定位研究

国家自然科学基金

2+阅读 · 2013年12月31日

镍基单晶高温合金PA EB-PVD γ/γ'涂层微观组织结构和抗高温氧化机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

Hg2CuTi型全Heusler合金表面与界面的半金属特性研究

国家自然科学基金

0+阅读 · 2012年12月31日

自形成纳米多层膜的微观结构及磁性耦合机理

国家自然科学基金

0+阅读 · 2012年12月31日

（类）钙钛矿结构氧化物纳米纤维的高温电化学性能

国家自然科学基金

0+阅读 · 2012年12月31日

强八元数矩阵代数与矢量传感器阵列多维信号处理

国家自然科学基金

0+阅读 · 2011年12月31日

多重刺激响应的纤维素基接枝共聚物结构与响应性

国家自然科学基金

0+阅读 · 2011年12月31日

Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering

Arxiv

0+阅读 · 2023年5月3日

Prediction of Performance and Power Consumption of GPGPU Applications

Arxiv

0+阅读 · 2023年5月3日

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Arxiv

0+阅读 · 2023年5月2日

Design and Evaluation of a Bioinspired Tendon-Driven 3D-Printed Robotic Eye with Active Vision Capabilities

Arxiv

0+阅读 · 2023年5月1日

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Arxiv

0+阅读 · 2023年5月1日

Fast evaluation of spherical harmonics with sphericart

Arxiv

0+阅读 · 2023年4月30日

Quantum Control Machine: The Limits of Quantum Programs as Data

Arxiv

0+阅读 · 2023年4月28日

Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

Arxiv

0+阅读 · 2023年4月28日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

VIP会员

文章信息

相关主题

英伟达（NVIDIA）

最新内容

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

专知会员服务

1+阅读 · 今天14:45

综述 | 世界动作模型：少做梦，多行动

综述 | 世界动作模型：少做梦，多行动

专知会员服务

1+阅读 · 今天14:43

美以伊冲突：无人机与人工智能的运用

美以伊冲突：无人机与人工智能的运用

专知会员服务

3+阅读 · 今天14:31

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

专知会员服务

3+阅读 · 今天14:20

《特种部队在透明战场中的生存力》最新报告

《特种部队在透明战场中的生存力》最新报告

专知会员服务

2+阅读 · 今天14:11

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

《自主无人机蜂群协同与控制系统：人工智能赋能的战场协同与自主任务编排平台》

专知会员服务

3+阅读 · 今天14:07

《人工智能生成的零日漏洞：对未来作战的影响》

《人工智能生成的零日漏洞：对未来作战的影响》

专知会员服务

3+阅读 · 今天14:03

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

《理解伙伴国在防务能力选择中的偏好：探索美国解决方案的替代选择》美智库200页报告

专知会员服务

2+阅读 · 今天13:59

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

ICML 2026 | 边界嵌入塑形：用自适应对比学习破解图结构纠缠

专知会员服务

5+阅读 · 6月22日

综述 | 3D场景图：开放挑战与未来方向

综述 | 3D场景图：开放挑战与未来方向

专知会员服务

8+阅读 · 6月22日

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

《国防工业6.0：全自主作战系统、量子-人工智能融合与新一代战略威慑》

专知会员服务

7+阅读 · 6月22日

21世纪的无人机战争

21世纪的无人机战争

专知会员服务

4+阅读 · 6月22日

《伊朗与以色列-美国热战及其对数字技术的影响》

《伊朗与以色列-美国热战及其对数字技术的影响》

专知会员服务

5+阅读 · 6月22日

《量子技术的军事任务技术适配与利用》

《量子技术的军事任务技术适配与利用》

专知会员服务

5+阅读 · 6月22日

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

《美国陆军军官学校（西点军校）本科生科研中生成式人工智能的使用》

专知会员服务

8+阅读 · 6月22日

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

254+阅读 · 2020年4月19日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

综述 | 世界动作模型：少做梦，多行动

《战时图神经网络：整合以色列-伊朗冲突中的网络安全与无人机智能》最新50页文献

ICML 2026 | CFPO：用反事实策略优化提升多模态推理

美以伊冲突：无人机与人工智能的运用

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

谷歌足球游戏环境使用介绍

谷歌足球游戏环境使用介绍

CreateAMind

33+阅读 · 2019年6月27日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering

Arxiv

0+阅读 · 2023年5月3日

Prediction of Performance and Power Consumption of GPGPU Applications

Arxiv

0+阅读 · 2023年5月3日

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Arxiv

0+阅读 · 2023年5月2日

Design and Evaluation of a Bioinspired Tendon-Driven 3D-Printed Robotic Eye with Active Vision Capabilities

Arxiv

0+阅读 · 2023年5月1日

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

Arxiv

0+阅读 · 2023年5月1日

Fast evaluation of spherical harmonics with sphericart

Arxiv

0+阅读 · 2023年4月30日

Quantum Control Machine: The Limits of Quantum Programs as Data

Arxiv

0+阅读 · 2023年4月28日

Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

Arxiv

0+阅读 · 2023年4月28日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

On the Opportunities and Risks of Foundation Models

Arxiv

30+阅读 · 2021年8月18日

相关基金

Alpha稳定分布环境下的非圆信号波达方向估计方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

高稳定性纳米复合多层膜钨基块材的制备和抗辐照性能研究

国家自然科学基金

0+阅读 · 2014年12月31日

不锈钢表面“膨胀”α相层形成与强韧化机理

国家自然科学基金

0+阅读 · 2013年12月31日

声矢量传感器阵列在非理想传输条件下的声源定位研究

国家自然科学基金

2+阅读 · 2013年12月31日

镍基单晶高温合金PA EB-PVD γ/γ'涂层微观组织结构和抗高温氧化机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

Hg2CuTi型全Heusler合金表面与界面的半金属特性研究

国家自然科学基金

0+阅读 · 2012年12月31日

自形成纳米多层膜的微观结构及磁性耦合机理

国家自然科学基金

0+阅读 · 2012年12月31日

（类）钙钛矿结构氧化物纳米纤维的高温电化学性能

国家自然科学基金

0+阅读 · 2012年12月31日

强八元数矩阵代数与矢量传感器阵列多维信号处理

国家自然科学基金

0+阅读 · 2011年12月31日

多重刺激响应的纤维素基接枝共聚物结构与响应性

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员