Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose STAR, a decode rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with a dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 75.1% and achieving 2.63 times higher goodput.
翻译:大语言模型(LLM)推理已成为基本范式,但输出长度差异导致解码阶段出现严重工作负载失衡,尤其是在长输出推理任务中。现有系统(如PD分离架构)依赖静态预填充-解码调度策略,在面对动态演变的解码工作负载时,常导致服务等级协议违例与内存溢出故障。本文提出STAR系统,通过长度预测实现解码重调度以预判未来工作负载。核心贡献包括:(1)轻量级且连续的LLM原生预测方法,利用LLM隐藏状态高精度建模剩余生成长度(平均绝对误差降低49.42%),且开销极低(预测器参数量减少93.28%);(2)解码阶段重调度方案,融合当前与预测工作负载的动态均衡机制,将P99 TPOT降低75.1%,有效吞吐率提升2.63倍。