Two widely adopted techniques for LLM inference serving systems today are hybrid batching and disaggregated serving. A hybrid batch combines prefill and decode tokens of different requests in the same batch to improve resource utilization and throughput at the cost of increased latency per token. In contrast, disaggregated serving decouples compute-bound prefill and bandwidth-bound decode phases to optimize for service level objectives (SLOs) at the cost of resource under-utilization and KV-cache transfer overheads. To address the limitations of these techniques, we propose RAPID-Serve: a technique to concurrently execute prefill and decode on the same GPU(s) to meet latency SLOs while maintaining high throughput and efficient resource utilization. Furthermore, we propose Adaptive Resource Management for runtime compute resource allocation, optionally leveraging CU masking (a fine-grained Compute Unit partitioning feature on AMD Instinct\textsuperscript{TM} GPUs). RAPID-Serve provides up to 4.1x (average 1.7x) unconstrained throughput improvement and 32x and higher (average 4.9x) throughput improvement under SLO constraints, showing it as an effective strategy compared to the state-of-the-art approaches, particularly in resource-constrained environments.
翻译:当前,大语言模型推理服务系统广泛采用的两项技术是混合批处理与解耦服务。混合批处理将不同请求的预填充和解码令牌组合在同一批次中,以提高资源利用率和吞吐量,但代价是增加了每个令牌的延迟。相比之下,解耦服务将计算密集的预填充阶段与带宽密集的解码阶段解耦,以优化服务级别目标,但代价是资源利用不足和KV缓存传输开销。为克服这些技术的局限性,我们提出RAPID-Serve:一种在同一GPU上并发执行预填充和解码的技术,旨在满足延迟SLO的同时,保持高吞吐量和高效的资源利用率。此外,我们提出了自适应资源管理方案,用于运行时计算资源分配,并可选择性地利用CU掩码(AMD Instinct\textsuperscript{TM} GPU上的一种细粒度计算单元分区功能)。RAPID-Serve在无约束条件下可实现高达4.1倍(平均1.7倍)的吞吐量提升,在SLO约束下可实现32倍及以上(平均4.9倍)的吞吐量提升,这表明与现有先进方法相比,尤其是在资源受限的环境中,它是一种有效的策略。