Modern GPUs adopt chiplet-based designs with multiple private cache hierarchies, but current programming models (CUDA/HIP) expose a flat execution hierarchy that cannot express chiplet-level locality or synchronization. This mismatch leads to redundant memory traffic and poor cache utilization in memory-bound workloads such as LLM inference. We present Fleet, a multi-level task model that maps computation to memory scopes. Fleet introduces Chiplet-tasks, a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse. On AMD Instinct MI350 with Qwen3-8B, Fleet achieves 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8 through persistent kernel execution and per-chiplet scheduling. At larger batch sizes, cooperative weight tiling increases L2 hit rate (from 12% to 54% at batch size 32 and from 39% to 61% at batch size 64), reducing HBM traffic by up to 37% and delivering 1.27-1.30x speedup over a chiplet-unaware megakernel baseline.
翻译:现代GPU采用基于芯粒的设计,每个芯粒拥有独立的私有缓存层次结构,但当前编程模型(CUDA/HIP)仅提供扁平化执行层次,无法表达芯粒级别的局部性或同步需求。这种不匹配导致LLM推理等内存密集型工作负载出现冗余内存流量和低缓存利用率。本文提出Fleet——一种将计算映射到内存作用域的多层级任务模型。Fleet引入芯粒任务(Chiplet-task)这一新抽象,将工作与数据绑定到特定芯粒,并通过其共享L2缓存实现协同。波前级、计算单元级与设备级任务与现有抽象兼容,而芯粒级任务则开辟了此前未处理的层次结构。Fleet实现为带芯粒级调度的持久化内核运行时,使芯粒内的工作线程能够通过协调缓存复用合作执行任务。在搭载Qwen3-8B的AMD Instinct MI350上,Fleet通过持久化内核执行与芯粒级调度,在批大小1-8时解码延迟比vLLM降低1.3-1.5倍。在更大批大小时,协同权重分块将L2命中率从12%提升至54%(批大小32)及从39%提升至61%(批大小64),HBM流量最高减少37%,相较于不感知芯粒的大核基线实现1.27-1.30倍加速。