Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.
翻译:定制化CUDA内核开发对于大规模分布式大语言模型训练与推理中最大化GPU利用率至关重要,然而手动编写能够协同利用计算与通信的内核仍然是劳动密集且易出错的过程。现有内核优化工作几乎完全聚焦于计算优化,即使通信内核在总执行时间中占据显著比例,却长期未被触及。本文提出CUCo——一种免训练的智能体驱动工作流,能够自动生成协同编排计算与通信的高性能CUDA内核。通过对传统上相互独立的组件进行协同优化,CUCo解锁了现有方法无法实现的新优化机会,在性能上超越了现有最优基线,并将端到端延迟降低达$1.57\times$。