GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface - 专知论文

会员服务 ·

0

GPUs · 控制器 · 中央处理器 (CPU) · MoDELS · Processing（编程语言） ·

GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface

翻译：暂无翻译

Yuki Maeda,Kenjiro Taura

from arxiv, 12 pages, 11 figures

Graphics Processing Units (GPUs) excel at regular data-parallel workloads where massive hardware parallelism can be readily exploited. In contrast, many important irregular applications are naturally expressed as task parallelism with a fork-join control structure. While CPU runtimes for fork-join task parallelism are mature, it remains challenging to efficiently support it on GPUs. We propose GTaP, a GPU-resident runtime that supports fork-join task parallelism. GTaP is based on the persistent kernel model, and supports two worker granularities: thread blocks and individual threads. To realize fork-join on GPUs, GTaP represents joins as continuations and executes each task as a state machine that can be split into multiple execution segments. We also extend Clang's frontend with a pragma-based programming model that enables programmers to express fork-join without exposing low-level mechanisms. GTaP employs work stealing for load balancing, providing better scalability than a global-queue approach. For thread-level workers, we further introduce Execution-Path-Aware Queueing (EPAQ), which allows programmers to partition task queues using user-defined criteria, reducing warp divergence caused by mixing heterogeneous control flows within a warp. Across representative irregular applications, GTaP outperforms OpenMP task-parallel execution on a 72-core CPU in many cases, especially for large problem sizes with compute-intensive tasks. We also show that GTaP's design choices outperform naive GPU alternatives. The benefit of EPAQ is workload-dependent: it can improve performance for some benchmarks while having little effect on others; on Fibonacci, EPAQ achieves up to a 1.8$\times$ speedup.

翻译：暂无翻译

0

相关内容

GPUs

【阿姆斯特丹博士论文】GPU图算法性能分析与预测，227页pdf

【阿姆斯特丹博士论文】GPU图算法性能分析与预测，227页pdf

专知会员服务

40+阅读 · 2023年4月10日

【ChatGPT系列报告】ChatGPT的“背后英雄”，100页报告看懂GPU

【ChatGPT系列报告】ChatGPT的“背后英雄”，100页报告看懂GPU

专知会员服务

122+阅读 · 2023年2月18日

面向多GPU的图神经网络训练加速

面向多GPU的图神经网络训练加速

专知会员服务

24+阅读 · 2023年1月19日

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

专知会员服务

46+阅读 · 2022年5月27日

【图机器学习论文】图神经网络的逻辑表达性（Logical Expressiveness of Graph Neural Networks）

【图机器学习论文】图神经网络的逻辑表达性（Logical Expressiveness of Graph Neural Networks）

专知会员服务

41+阅读 · 2019年12月30日

【图神经网络综述】图神经网络的方法和应用，清华大学周杰

【图神经网络综述】图神经网络的方法和应用，清华大学周杰

专知会员服务

474+阅读 · 2019年12月15日

【O'Reilly TensorFlow Conference 2019】HARP：高效的GPU共享系统（HARP: An efficient and elastic GPU-sharing system），Alibaba | Pengfei Fan，Lingling Jin

【O'Reilly TensorFlow Conference 2019】HARP：高效的GPU共享系统（HARP: An efficient and elastic GPU-sharing system），Alibaba | Pengfei Fan，Lingling Jin

专知会员服务

10+阅读 · 2019年11月13日

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

专知会员服务

21+阅读 · 2019年11月5日

Deep Learning for Graphs: Models and Applications，密歇根州立大学唐继良助理教授，CIPS ATT 16（2019）

Deep Learning for Graphs: Models and Applications，密歇根州立大学唐继良助理教授，CIPS ATT 16（2019）

专知会员服务

54+阅读 · 2019年10月25日

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

专知会员服务

19+阅读 · 2019年6月16日

盘点来自工业界的GPU共享方案

盘点来自工业界的GPU共享方案

计算机视觉life

12+阅读 · 2021年9月2日

图神经网络模型集合GraphGallery，TensorFLow&PyTorch一并实现

图神经网络模型集合GraphGallery，TensorFLow&PyTorch一并实现

专知

20+阅读 · 2020年10月5日

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

AINLP

29+阅读 · 2020年8月9日

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

AINLP

14+阅读 · 2019年9月4日

Colab 免费提供 Tesla T4 GPU，是时候薅羊毛了

Colab 免费提供 Tesla T4 GPU，是时候薅羊毛了

机器之心

10+阅读 · 2019年4月25日

Github项目推荐 | 图神经网络(GNN)相关资源大列表

Github项目推荐 | 图神经网络(GNN)相关资源大列表

AI研习社

58+阅读 · 2019年4月1日

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

未来产业促进会

18+阅读 · 2019年3月10日

PyTorch实现多种深度强化学习算法

PyTorch实现多种深度强化学习算法

专知

36+阅读 · 2019年1月15日

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

AI研习社

11+阅读 · 2018年6月27日

深度学习的GPU：深度学习中使用GPU的经验和建议

深度学习的GPU：深度学习中使用GPU的经验和建议

数据挖掘入门与实战

11+阅读 · 2018年1月3日

天元数学交流项目图像处理中的数学理论及方法研讨会

国家自然科学基金

9+阅读 · 2017年12月31日

基于GPU的几类分数阶微分方程的并行算法研究及其实现

国家自然科学基金

0+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

使用GPU加速银道面尘埃辐射图像的高分辨率模拟与多参数反演

国家自然科学基金

0+阅读 · 2015年12月31日

神经形态多核处理器的架构模型研究

国家自然科学基金

3+阅读 · 2015年12月31日

大视场综合孔径成像优化的研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向存储受限应用的GPU性能预测模型和通信优化关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

面向大数据的高时效并行计算机系统结构与技术

国家自然科学基金

0+阅读 · 2014年12月31日

CPU和GPU混合体系结构上生物网络比对并行算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向类脑计算存储器的调制编码理论及方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

Arxiv

0+阅读 · 4月29日

GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

Arxiv

0+阅读 · 4月24日

A Stackelberg Game Framework with Drainability Guardrails for Pricing and Scaling in Multi-Tenant GPU Cloud Platforms

Arxiv

0+阅读 · 4月18日

Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing

Arxiv

0+阅读 · 4月17日

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Arxiv

0+阅读 · 4月9日

A Case For Host Code Guided GPU Data Race Detector

Arxiv

0+阅读 · 4月2日

From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters

Arxiv

0+阅读 · 3月31日

PRISM: Dynamic Primitive-Based Forecasting for Large-Scale GPU Cluster Workloads

Arxiv

0+阅读 · 3月26日

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Arxiv

0+阅读 · 3月22日

GPU-Accelerated Genetic Programming for Symbolic Regression with Beagle Framework

Arxiv

0+阅读 · 3月10日

VIP会员

文章信息

相关主题

中央处理器 (CPU)

Processing（编程语言）

最新内容

DeepSeek 版Claude Code，免费小白安装教程来了！

DeepSeek 版Claude Code，免费小白安装教程来了！

专知会员服务

6+阅读 · 5月5日

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

2+阅读 · 5月5日

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

专知会员服务

2+阅读 · 5月5日

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

专知会员服务

3+阅读 · 5月5日

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

专知会员服务

4+阅读 · 5月5日

《美空军条令出版物 2-0：情报（2026版）》

《美空军条令出版物 2-0：情报（2026版）》

专知会员服务

11+阅读 · 5月5日

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

专知会员服务

3+阅读 · 5月5日

帕兰提尔 Gotham：一个游戏规则改变器

帕兰提尔 Gotham：一个游戏规则改变器

专知会员服务

5+阅读 · 5月5日

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

专知会员服务

2+阅读 · 5月5日

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

专知会员服务

2+阅读 · 5月5日

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

专知会员服务

8+阅读 · 5月4日

【综述】机器人学习中的世界模型：全面综述

【综述】机器人学习中的世界模型：全面综述

专知会员服务

11+阅读 · 5月4日

伊朗的导弹-无人机行动及其对美国威慑的影响

伊朗的导弹-无人机行动及其对美国威慑的影响

专知会员服务

8+阅读 · 5月4日

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

专知会员服务

8+阅读 · 5月4日

战争贩子：2026年第一季度美国对中东潜在军售激增

战争贩子：2026年第一季度美国对中东潜在军售激增

专知会员服务

6+阅读 · 5月4日

相关VIP内容

【阿姆斯特丹博士论文】GPU图算法性能分析与预测，227页pdf

【阿姆斯特丹博士论文】GPU图算法性能分析与预测，227页pdf

专知会员服务

40+阅读 · 2023年4月10日

【ChatGPT系列报告】ChatGPT的“背后英雄”，100页报告看懂GPU

【ChatGPT系列报告】ChatGPT的“背后英雄”，100页报告看懂GPU

专知会员服务

122+阅读 · 2023年2月18日

面向多GPU的图神经网络训练加速

面向多GPU的图神经网络训练加速

专知会员服务

24+阅读 · 2023年1月19日

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

专知会员服务

46+阅读 · 2022年5月27日

【图机器学习论文】图神经网络的逻辑表达性（Logical Expressiveness of Graph Neural Networks）

【图机器学习论文】图神经网络的逻辑表达性（Logical Expressiveness of Graph Neural Networks）

专知会员服务

41+阅读 · 2019年12月30日

【图神经网络综述】图神经网络的方法和应用，清华大学周杰

【图神经网络综述】图神经网络的方法和应用，清华大学周杰

专知会员服务

474+阅读 · 2019年12月15日

【O'Reilly TensorFlow Conference 2019】HARP：高效的GPU共享系统（HARP: An efficient and elastic GPU-sharing system），Alibaba | Pengfei Fan，Lingling Jin

【O'Reilly TensorFlow Conference 2019】HARP：高效的GPU共享系统（HARP: An efficient and elastic GPU-sharing system），Alibaba | Pengfei Fan，Lingling Jin

专知会员服务

10+阅读 · 2019年11月13日

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

【O'Reilly AI Conference 2019】使用GPU和Docker容器进行Horovod和Spark深度学习（Deep learning with Horovod and Spark using GPUs and Docker containers），BlueData的联合创始人兼首席架构师Thomas Phelan

专知会员服务

21+阅读 · 2019年11月5日

Deep Learning for Graphs: Models and Applications，密歇根州立大学唐继良助理教授，CIPS ATT 16（2019）

Deep Learning for Graphs: Models and Applications，密歇根州立大学唐继良助理教授，CIPS ATT 16（2019）

专知会员服务

54+阅读 · 2019年10月25日

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

【CVPR 2019 | tutorial】通过图结构网络学习表示Learning Representations via Graph-structured Networks，圣地亚哥大学|Xiaolong Wang，英伟达|Sifei Liu

专知会员服务

19+阅读 · 2019年6月16日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

DeepSeek 版Claude Code，免费小白安装教程来了！

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

相关资讯

盘点来自工业界的GPU共享方案

盘点来自工业界的GPU共享方案

计算机视觉life

12+阅读 · 2021年9月2日

图神经网络模型集合GraphGallery，TensorFLow&PyTorch一并实现

图神经网络模型集合GraphGallery，TensorFLow&PyTorch一并实现

专知

20+阅读 · 2020年10月5日

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

AINLP

29+阅读 · 2020年8月9日

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

AINLP

14+阅读 · 2019年9月4日

Colab 免费提供 Tesla T4 GPU，是时候薅羊毛了

Colab 免费提供 Tesla T4 GPU，是时候薅羊毛了

机器之心

10+阅读 · 2019年4月25日

Github项目推荐 | 图神经网络(GNN)相关资源大列表

Github项目推荐 | 图神经网络(GNN)相关资源大列表

AI研习社

58+阅读 · 2019年4月1日

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

LeCun推荐：最新PyTorch图神经网络库，速度快15倍（GitHub+论文）

未来产业促进会

18+阅读 · 2019年3月10日

PyTorch实现多种深度强化学习算法

PyTorch实现多种深度强化学习算法

专知

36+阅读 · 2019年1月15日

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

Github 项目推荐 | Nvidia 用于数据增强和 JPEG 图像解码的 GPU 加速库 DALI

AI研习社

11+阅读 · 2018年6月27日

深度学习的GPU：深度学习中使用GPU的经验和建议

深度学习的GPU：深度学习中使用GPU的经验和建议

数据挖掘入门与实战

11+阅读 · 2018年1月3日

相关论文

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

Arxiv

0+阅读 · 4月29日

GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

Arxiv

0+阅读 · 4月24日

A Stackelberg Game Framework with Drainability Guardrails for Pricing and Scaling in Multi-Tenant GPU Cloud Platforms

Arxiv

0+阅读 · 4月18日

Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing

Arxiv

0+阅读 · 4月17日

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Arxiv

0+阅读 · 4月9日

A Case For Host Code Guided GPU Data Race Detector

Arxiv

0+阅读 · 4月2日

From Skew to Symmetry: Node-Interconnect Multi-Path Balancing with Execution-time Planning for Modern GPU Clusters

Arxiv

0+阅读 · 3月31日

PRISM: Dynamic Primitive-Based Forecasting for Large-Scale GPU Cluster Workloads

Arxiv

0+阅读 · 3月26日

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Arxiv

0+阅读 · 3月22日

GPU-Accelerated Genetic Programming for Symbolic Regression with Beagle Framework

Arxiv

0+阅读 · 3月10日

相关基金

天元数学交流项目图像处理中的数学理论及方法研讨会

国家自然科学基金

9+阅读 · 2017年12月31日

基于GPU的几类分数阶微分方程的并行算法研究及其实现

国家自然科学基金

0+阅读 · 2015年12月31日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

使用GPU加速银道面尘埃辐射图像的高分辨率模拟与多参数反演

国家自然科学基金

0+阅读 · 2015年12月31日

神经形态多核处理器的架构模型研究

国家自然科学基金

3+阅读 · 2015年12月31日

大视场综合孔径成像优化的研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向存储受限应用的GPU性能预测模型和通信优化关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

面向大数据的高时效并行计算机系统结构与技术

国家自然科学基金

0+阅读 · 2014年12月31日

CPU和GPU混合体系结构上生物网络比对并行算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向类脑计算存储器的调制编码理论及方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员