The long-tailed distribution of sequence lengths in LLM serving and reinforcement learning (RL) sampling causes significant computational waste due to excessive padding in batched inference. Existing methods rely on auxiliary models for static length prediction, but they incur high overhead, generalize poorly, and fail in stochastic "one-to-many" sampling scenarios. We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction. Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction with negligible cost, and 2) Progressive Length Prediction (PLP), which dynamically estimates the remaining length at each decoding step to handle stochastic generation. To validate our approach, we build and release ForeLen, a comprehensive benchmark with long-sequence, Chain-of-Thought, and RL data. On ForeLen, EGTP achieves state-of-the-art accuracy, reducing MAE by 29.16\% over the best baseline. Integrating our methods with a length-aware scheduler yields significant end-to-end throughput gains. Our work provides a new technical and evaluation baseline for efficient LLM inference.
翻译:大语言模型服务与强化学习采样中序列长度的长尾分布,因批处理推理中过度的填充操作而导致显著的计算浪费。现有方法依赖辅助模型进行静态长度预测,但存在开销高、泛化能力差且在随机"一对多"采样场景中失效的问题。本文提出一种轻量级框架,通过复用主模型的内部隐藏状态实现高效长度预测。该框架包含两个核心组件:1) 熵引导令牌池化(EGTP),利用实时激活值与令牌熵实现高精度静态预测且计算成本可忽略;2) 渐进式长度预测(PLP),在解码过程中动态估计剩余生成长度以处理随机生成场景。为验证方法有效性,我们构建并开源了ForeLen基准数据集,涵盖长序列、思维链及强化学习数据。在ForeLen测试中,EGTP实现了最优预测精度,其平均绝对误差较最佳基线降低29.16%。将本方法与长度感知调度器集成后,可获得显著的端到端吞吐量提升。本研究为高效大语言模型推理提供了新的技术与评估基准。