Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.
翻译:分布式大语言模型训练与推理中的计算和通信传统上是孤立优化,而 DeepEP、FLUX 和 TokenWeave 等专家级系统虽展示了协同设计的潜力,但需深厚的系统专业知识与硬件特化调优。CUCo 是一个智能体框架,通过将结构化的设计空间形式化与两种路径相结合——基于正确性优先的快速路径智能体用于生成可靠基线,以及基于演化驱动的慢速路径智能体用于探索高性能策略——自动完成 CUDA 内核的计算-通信协同设计,在四个多 GPU 工作负载上实现最高 1.57 倍加速,并在 DeepSeek-V3 MoE 层中发现了一种双流重叠策略——该策略将 Token 分发隐藏在本地计算之后——每个工作负载的 LLM 推理成本低于 10 美元。