Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup. Unfortunately, CUDA graphs cannot be naively serialized: beyond graph topology, they are tightly coupled to execution context, including device addresses embedded in kernel arguments and kernel code lazily loaded during warmup. Existing approaches either rely on brittle kernel-specific patching or heavyweight process-level checkpoint/restore that are inflexible to dynamic parallelism switching. We present Foundry, a template-based CUDA graph context materialization system that persists both graph topology and execution context during an offline processing stage, and reconstructs executable graphs online with negligible overhead. Foundry enforces deterministic memory layouts, automatically extracts and reloads kernel binaries required by captured graphs, and reduces online reconstruction costs through topology-based templating. For distributed serving, Foundry further enables a single-GPU offline capture to generate templates for multi-GPU deployments by patching only rank-dependent communication state. Across dense and MoE models up to 235B parameters, Foundry reduces cold-start latency by up to 99%, cutting the initialization time of Qwen3-235B-A22B from 10 minutes to 3.9 seconds while preserving the throughput gains of CUDA graphs.
翻译:现代大语言模型服务提供商日益依赖自动扩缩容与并行度重配置以应对快速变化的负载,但冷启动延迟仍是主要瓶颈。尽管近期系统已将模型权重加载时间缩短至秒级,CUDA Graph捕获过程仍需数十秒至数分钟,往往成为启动主导环节。然而,CUDA Graph无法直接序列化:除图拓扑结构外,其与执行上下文紧密耦合,包括嵌入内核参数中的设备地址及预热期间惰性加载的内核代码。现有方法或依赖脆弱的特定内核补丁,或采用笨重的进程级检查点/恢复机制,难以灵活应对动态并行切换。我们提出Foundry——一种基于模板的CUDA Graph上下文物化系统,在离线处理阶段持久化图拓扑与执行上下文,并以可忽略的开销在线重建可执行图。Foundry通过强制确定性内存布局、自动提取并重载捕获图所需的内核二进制文件,以及基于拓扑的模板化技术降低在线重建成本。针对分布式服务,Foundry进一步实现单GPU离线捕获生成多GPU部署模板,仅需修补依赖秩的通信状态。在涵盖密集模型与MoE模型(参数规模高达2350亿)的实验中,Foundry将冷启动延迟降低最高99%,将Qwen3-235B-A22B的初始化时间从10分钟压缩至3.9秒,同时保留CUDA Graph的吞吐量优势。