FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow - 专知论文

会员服务 ·

0

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

翻译：暂无翻译

Sina Heidari,Dimitrios S. Nikolopoulos

Deep learning compilers and vendor libraries deliver strong baseline performance but are bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute hand-written CUDA or CUTLASS, demanding expertise in GPU microarchitecture and C++ template metaprogramming. Recent LLM-based agents target kernel generation in raw CUDA, forcing rediscovery of optimizations already encoded in mature libraries. We present FACT (Framework for Agentic CUTLASS Transpilation), a framework that employs a three-stage, agent-driven workflow optimizing PyTorch modules through multi-pattern composition while grounding synthesis in CUTLASS C++. (1) Pattern discovery: an LLM agent inspects the traced graph, matches subgraphs to optimization rules, retrieves vetted examples from an architecture-specific index, and outputs prioritized patterns. (2) Pattern realization: each pattern is implemented as a CUTLASS kernel wrapped in a PyTorch extension, verified, and auto-tuned by sweeping parameters inferred from the CUTLASS hierarchy. (3) Pattern composition: extensions are loaded together into a single composed module for end-to-end benchmarking. We evaluate the workflow using KernelBench's evaluation framework and provided modules on an NVIDIA A100. On Level 1, we apply the workflow to three GEMM workloads (square matrix multiply, batched matrix multiply, and large-$K$ matrix multiply). Auto-tuned CUTLASS kernels improve over PyTorch cuBLAS baseline by $1.06\times$--$1.18\times$. On Level 3 MiniGPT block, composing fused multi-head attention with fused MLP GEMM+GELU yields $2.79\times$ end-to-end speedup. Our work couples agentic graph-level pattern discovery with auto-tuning and a dynamic pattern table, offering a practical path from traced PyTorch to deployable kernels by automating CUTLASS kernel synthesis and auto-tuning.

翻译：暂无翻译

0

相关内容

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

40+阅读 · 2025年10月17日

最新，DeepSeek-R1论文登上Nature封面，附83页补充材料

最新，DeepSeek-R1论文登上Nature封面，附83页补充材料

专知会员服务

27+阅读 · 2025年9月18日

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

专知会员服务

42+阅读 · 2022年10月15日

清华孙茂松等自然·通讯杂志发表生物医学知识计算研究《深度学习系统桥接分子结构和生物医学文本，具有与人类专业相当的理解力》

清华孙茂松等自然·通讯杂志发表生物医学知识计算研究《深度学习系统桥接分子结构和生物医学文本，具有与人类专业相当的理解力》

专知会员服务

22+阅读 · 2022年2月23日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【伯克利博士论文】如何让机器人多技能？通过最大熵强化学习(107页pdf)

【伯克利博士论文】如何让机器人多技能？通过最大熵强化学习(107页pdf)

专知会员服务

78+阅读 · 2019年10月27日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

专知会员服务

208+阅读 · 2019年9月30日

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

AINLP

29+阅读 · 2020年8月9日

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

AINLP

14+阅读 · 2019年9月4日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

AINLP

12+阅读 · 2018年11月12日

资源 | 《Scikit-Learn与TensorFlow》中文精要

资源 | 《Scikit-Learn与TensorFlow》中文精要

AI研习社

25+阅读 · 2018年9月21日

手把手教 | 深度学习库PyTorch（附代码）

手把手教 | 深度学习库PyTorch（附代码）

数据派THU

27+阅读 · 2018年3月15日

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

专知

27+阅读 · 2017年12月17日

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

众核集群上基于MPI的模型扩展及性能优化研究

国家自然科学基金

1+阅读 · 2015年12月31日

神经形态多核处理器的架构模型研究

国家自然科学基金

3+阅读 · 2015年12月31日

大型异构系统上数百万核可扩展的新型区域分裂隐式求解器研究

国家自然科学基金

0+阅读 · 2015年12月31日

三维片上网络芯片关键设计技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

CPU和GPU混合体系结构上生物网络比对并行算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据计算的高吞吐量众核处理器关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向类脑计算存储器的调制编码理论及方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

环氧树脂基交联网络微观结构调控及其热致形状记忆构效关系

国家自然科学基金

0+阅读 · 2014年12月31日

基于支持向量机的复杂连续系统强化学习控制研究

国家自然科学基金

12+阅读 · 2008年12月31日

Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery

Arxiv

0+阅读 · 4月29日

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

Arxiv

0+阅读 · 4月23日

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

Arxiv

0+阅读 · 4月13日

Synthetic Sandbox for Training Machine Learning Engineering Agents

Arxiv

0+阅读 · 4月6日

O-ConNet: Geometry-Aware End-to-End Inference of Over-Constrained Spatial Mechanisms

Arxiv

0+阅读 · 4月2日

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Arxiv

0+阅读 · 4月1日

Central limit theorems for the outputs of fully convolutional neural networks with time series input

Arxiv

0+阅读 · 3月31日

Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Arxiv

0+阅读 · 3月24日

A Survey on Generative Diffusion Model

Arxiv

46+阅读 · 2022年9月6日

Survey on Evolutionary Deep Learning: Principles, Algorithms, Applications and Open Issues

Arxiv

20+阅读 · 2022年8月23日

VIP会员

文章信息

相关主题

最新内容

DeepSeek 版Claude Code，免费小白安装教程来了！

DeepSeek 版Claude Code，免费小白安装教程来了！

专知会员服务

6+阅读 · 5月5日

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

专知会员服务

2+阅读 · 5月5日

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

专知会员服务

2+阅读 · 5月5日

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

专知会员服务

4+阅读 · 5月5日

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

《火炮弹药快速效能建模：提升互操作性与技术优势》（报告）

专知会员服务

4+阅读 · 5月5日

《美空军条令出版物 2-0：情报（2026版）》

《美空军条令出版物 2-0：情报（2026版）》

专知会员服务

11+阅读 · 5月5日

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

美陆军“飞蝇陷阱5.0”项目将新兴技术交到作战人员手中

专知会员服务

3+阅读 · 5月5日

帕兰提尔 Gotham：一个游戏规则改变器

帕兰提尔 Gotham：一个游戏规则改变器

专知会员服务

5+阅读 · 5月5日

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

【ICML 2026】用测试时训练线性化视觉Transformer：T⁵ 实现 Softmax 注意力到线性复杂度的快速转换

专知会员服务

2+阅读 · 5月5日

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

【AAAI 2026】大模型做知识蒸馏：CMM将LLM特征拆解给小模型协同学习

专知会员服务

2+阅读 · 5月5日

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

【ICML Spotlight 2026 】NonZero：交互引导探索的多智能体蒙特卡洛树搜索

专知会员服务

8+阅读 · 5月4日

【综述】机器人学习中的世界模型：全面综述

【综述】机器人学习中的世界模型：全面综述

专知会员服务

11+阅读 · 5月4日

伊朗的导弹-无人机行动及其对美国威慑的影响

伊朗的导弹-无人机行动及其对美国威慑的影响

专知会员服务

8+阅读 · 5月4日

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

《未来战术无人机系统案例研究：量身定制采办策略方法》100页报告

专知会员服务

8+阅读 · 5月4日

战争贩子：2026年第一季度美国对中东潜在军售激增

战争贩子：2026年第一季度美国对中东潜在军售激增

专知会员服务

6+阅读 · 5月4日

相关VIP内容

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

专知会员服务

40+阅读 · 2025年10月17日

最新，DeepSeek-R1论文登上Nature封面，附83页补充材料

最新，DeepSeek-R1论文登上Nature封面，附83页补充材料

专知会员服务

27+阅读 · 2025年9月18日

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

李宏毅老师讲解！《AlphaTensor: 用强化学习找出更有效率的矩阵相乘，附Slides与视频

专知会员服务

42+阅读 · 2022年10月15日

清华孙茂松等自然·通讯杂志发表生物医学知识计算研究《深度学习系统桥接分子结构和生物医学文本，具有与人类专业相当的理解力》

清华孙茂松等自然·通讯杂志发表生物医学知识计算研究《深度学习系统桥接分子结构和生物医学文本，具有与人类专业相当的理解力》

专知会员服务

22+阅读 · 2022年2月23日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【伯克利博士论文】如何让机器人多技能？通过最大熵强化学习(107页pdf)

【伯克利博士论文】如何让机器人多技能？通过最大熵强化学习(107页pdf)

专知会员服务

78+阅读 · 2019年10月27日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

GAN新书《生成式深度学习》，Generative Deep Learning，379页pdf

专知会员服务

208+阅读 · 2019年9月30日

热门VIP内容

开通专知VIP会员享更多权益服务

【ICML Spotlight 2026】 T²PO: 不确定性引导的探索控制框架，实现稳定多轮Agentic强化学习

《机动炮兵的演进与未来：技术进步、历史沿革与炮兵作战前瞻》

DeepSeek 版Claude Code，免费小白安装教程来了！

基础模型驱动的工业智能体：技术成熟度、能力变迁与未竟之挑战

相关资讯

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

中文版！学习TensorFlow、PyTorch、机器学习、深度学习四件套！（附免费下载）

AINLP

29+阅读 · 2020年8月9日

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

GAN新书《生成式深度学习》Generative Deep Learning，附379页全文PDF

专知

96+阅读 · 2019年9月30日

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

【Github】nlp-tutorial：TensorFlow 和 PyTorch 实现各种NLP模型

AINLP

14+阅读 · 2019年9月4日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

Hands-on Machine Learning with Scikit-Learn and TensorFlow 学习笔记

AINLP

12+阅读 · 2018年11月12日

资源 | 《Scikit-Learn与TensorFlow》中文精要

资源 | 《Scikit-Learn与TensorFlow》中文精要

AI研习社

25+阅读 · 2018年9月21日

手把手教 | 深度学习库PyTorch（附代码）

手把手教 | 深度学习库PyTorch（附代码）

数据派THU

27+阅读 · 2018年3月15日

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

【下载】面向Open AI, TensorFlow, Keras的强化学习书籍《Reinforcement Learning》

专知

27+阅读 · 2017年12月17日

相关论文

Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery

Arxiv

0+阅读 · 4月29日

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

Arxiv

0+阅读 · 4月23日

Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

Arxiv

0+阅读 · 4月13日

Synthetic Sandbox for Training Machine Learning Engineering Agents

Arxiv

0+阅读 · 4月6日

O-ConNet: Geometry-Aware End-to-End Inference of Over-Constrained Spatial Mechanisms

Arxiv

0+阅读 · 4月2日

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Arxiv

0+阅读 · 4月1日

Central limit theorems for the outputs of fully convolutional neural networks with time series input

Arxiv

0+阅读 · 3月31日

Numerical Kernels on a Spatial Accelerator: A Study of Tenstorrent Wormhole

Arxiv

0+阅读 · 3月24日

A Survey on Generative Diffusion Model

Arxiv

46+阅读 · 2022年9月6日

Survey on Evolutionary Deep Learning: Principles, Algorithms, Applications and Open Issues

Arxiv

20+阅读 · 2022年8月23日

相关基金

面向数万处理器的有限元线性方程组与模态多级算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

众核集群上基于MPI的模型扩展及性能优化研究

国家自然科学基金

1+阅读 · 2015年12月31日

神经形态多核处理器的架构模型研究

国家自然科学基金

3+阅读 · 2015年12月31日

大型异构系统上数百万核可扩展的新型区域分裂隐式求解器研究

国家自然科学基金

0+阅读 · 2015年12月31日

三维片上网络芯片关键设计技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

CPU和GPU混合体系结构上生物网络比对并行算法研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据计算的高吞吐量众核处理器关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

面向类脑计算存储器的调制编码理论及方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

环氧树脂基交联网络微观结构调控及其热致形状记忆构效关系

国家自然科学基金

0+阅读 · 2014年12月31日

基于支持向量机的复杂连续系统强化学习控制研究

国家自然科学基金

12+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员