Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution by offloading KV cache management and parts of attention computation to the CPU. However, a key bottleneck remains: existing schedulers fail to effectively overlap CPU-offloaded tasks with GPU execution during the latency-critical, bandwidth-bound decode phase. This particularly penalizes real-time, decode-heavy applications (e.g., chat, Chain-of-Thought reasoning) which are currently underserved by existing systems, especially under memory pressure typical of edge or low-cost deployments. We present APEX, a novel, profiling-informed scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference. Unlike systems relying on static rules or purely heuristic approaches, APEX dynamically dispatches compute across heterogeneous resources by predicting execution times of CPU and GPU subtasks to maximize overlap while avoiding scheduling overheads. We evaluate APEX on diverse workloads and GPU architectures (NVIDIA T4, A10), using LLaMa-2-7B and LLaMa-3.1-8B models. Compared to GPU-only schedulers like vLLM, APEX improves throughput by 84% - 96% on T4 and 11% - 89% on A10 GPUs, while preserving latency. Against the best existing hybrid schedulers, it delivers up to 72% (T4) and 37% (A10) higher throughput in long-output settings. APEX significantly advances hybrid LLM inference efficiency on such memory-constrained hardware and provides a blueprint for scheduling in heterogeneous AI systems, filling a critical gap for efficient real-time LLM applications.
翻译:在线部署大语言模型进行推理常受限于GPU内存容量,尤其是自回归解码过程中不断增长的键值缓存。通过将键值缓存管理及部分注意力计算卸载至CPU执行的混合GPU-CPU协同方案已成为一种前景广阔的解决途径。然而,当前仍存在一个关键瓶颈:在延迟敏感、带宽受限的解码阶段,现有调度器无法有效实现CPU卸载任务与GPU执行的重叠计算。这尤其影响了实时性强、解码密集型应用(如对话系统、思维链推理)的性能表现,而现有系统在此类场景下支持不足,在边缘计算或低成本部署常见的内存受限环境中尤为突出。本文提出APEX——一种基于性能剖析的新型调度策略,可在混合大语言模型推理过程中最大化CPU-GPU并行度。与依赖静态规则或纯启发式方法的系统不同,APEX通过预测CPU与GPU子任务的执行时间,在避免调度开销的同时动态分配异构计算资源以实现最大重叠度。我们在多样化工作负载及GPU架构(NVIDIA T4、A10)上使用LLaMa-2-7B与LLaMa-3.1-8B模型对APEX进行评估。相较于vLLM等纯GPU调度器,APEX在T4 GPU上实现84%-96%的吞吐量提升,在A10 GPU上提升11%-89%,同时保持延迟水平不变。与现有最优混合调度器相比,其在长文本生成场景中分别带来最高72%(T4)和37%(A10)的吞吐量增益。APEX显著提升了内存受限硬件环境下混合大语言模型推理的效率,为异构人工智能系统调度提供了可复用的设计范式,填补了高效实时大语言模型应用领域的关键技术空白。