From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph

Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments. To bridge this gap, we propose ReGraphT, a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. ReGraphT organizes CUDA optimization trajectories into a structured reasoning graph, modeling the combined CUDA optimizations as state transitions, and leverages Monte Carlo Graph Search (MCGS) for efficient exploration. We also present a CUDA-specific benchmark with difficulty tiers defined by reasoning complexity to evaluate models more comprehensively. Experiments show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval. When paired with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without the associated privacy risks or excessive computing overhead.

翻译：尽管CUDA编程和领域特定库已取得显著进展，但有效利用具有大规模并行引擎的GPU仍然困难。大型语言模型（LLM）在从顺序代码生成优化CUDA代码方面展现出强大潜力。然而，在实践中使用LLM面临两大挑战：基于云的API存在代码泄露风险，而本地部署通常计算成本高昂且效率低下。这些缺点激发了人们对更轻量级、隐私友好的小型语言模型（SLM）的兴趣。令人鼓舞的是，近期研究表明SLM在特定任务上能达到与LLM相当的性能。虽然SLM在领域特定任务上可与LLM匹敌，但根据我们的实验，其有限的推理能力导致在复杂CUDA代码生成中表现欠佳。为弥补这一差距，我们提出ReGraphT——一种免训练、检索增强的生成框架，可将LLM级推理能力迁移至较小模型。ReGraphT将CUDA优化轨迹组织为结构化推理图，将组合式CUDA优化建模为状态转移，并利用蒙特卡洛图搜索（MCGS）进行高效探索。我们还提出了一个按推理复杂度划分难度层级的CUDA专用基准测试，以更全面地评估模型性能。实验表明，ReGraphT在CUDAEval和ParEval基准上平均实现2.33倍加速，优于针对高性能计算专门微调的模型及其他检索增强方法。当与DeepSeek-Coder-V2-Lite-Instruct和Qwen2.5-Coder-7B-Instruct结合使用时，ReGraphT能使SLM在避免隐私风险与过高计算开销的前提下，达到接近LLM的性能水平。