Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.
翻译:现代GPU工作负载,尤其是大语言模型(LLM)推理,受限于内核启动开销和粗粒度同步,这些因素制约了内核间并行性的发挥。近期提出的巨型核技术通过将多个算子融合为单一持久化内核,消除了启动间隙并释放了内核间并行性,但难以应对真实场景中的动态形状和数据依赖计算。本文提出事件张量——一种面向动态巨型核的统一编译器抽象。该抽象以分片任务间的依赖关系编码为核心,首次实现了对形状动态性和数据依赖动态性的一等支持。基于此抽象构建的事件张量编译器(ETC)通过静态与动态调度变换,生成高性能持久化内核。评估表明,ETC在实现大语言模型服务端到端延迟达到业界最优水平的同时,显著降低了系统预热开销。