The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.
翻译:基于思维链推理的推理型大语言模型的出现带来了新的服务挑战,因其延长的推理阶段会延迟用户可见的输出并增加首词生成时间。现有的大语言模型服务框架未能区分推理与回答阶段,导致在GPU内存约束下性能下降。本文提出PASCAL,一种阶段感知调度算法,该算法通过优先处理推理以降低首词生成时间,同时在回答阶段采用受控抢占与令牌节奏控制以保障体验质量。我们的分层调度器结合了实例级部署与实例内执行,并支持在阶段边界进行动态迁移以实现负载均衡并减少干扰。在使用DeepSeek-R1-Distill-Qwen-32B的基准测试中,PASCAL将尾部首词生成时间降低了最高达72%,同时保持了回答阶段的服务水平协议达成率,这证明了阶段感知调度对于推理型大语言模型部署的重要性。