Large Language Model (LLM) inference presents a unique scheduling challenge due to the Key-Value (KV) cache, where a job's memory footprint grows linearly with the number of decoded tokens. This growth couples scheduling decisions with feasibility: a scheduler must minimize latency under a hard memory budget, yet the response lengths of requests are inherently unknown. While recent works have explored this problem either assuming clairvoyance -- exact knowledge of response lengths -- or relying on machine-learned predictions, obtaining robust performance guarantees without any prior knowledge of job sizes remains a theoretically fundamental and practically important open problem. In this work, we propose the Geometric Slicing Algorithm (GSA), the non-clairvoyant policy to achieve the first constant competitive ratio for this problem in the offline batch setting. GSA manages uncertainty through a geometric phase structure that periodically restarts jobs to bound memory exposure, combined with a staggered pipeline mechanism that enables high concurrency by smoothing aggregate memory consumption. We prove that GSA achieves a competitive ratio of at most 61.92 for general instances, improving to 32 in the large-memory regime. Our algorithmic framework also yields a clairvoyant counterpart, the Geometric Batching Algorithm (GBA), which achieves an approximation ratio of 10.67 for general instances and 6.75 in the large-memory regime -- significantly improving upon the best previously known bound of over 9000. Numerical experiments on real request traces demonstrate that our algorithms perform robustly while preserving these worst-case guarantees.
翻译:大型语言模型(LLM)推理因键值(KV)缓存机制而面临独特的调度挑战:任务的内存占用随解码令牌数量线性增长。这种增长将调度决策与可行性紧密耦合:调度器必须在严格的内存预算下最小化延迟,而请求的响应长度本质上是未知的。虽然近期研究已探索过该问题——或假设先知条件(即精确知晓响应长度),或依赖机器学习预测——但在完全未知任务规模的情况下获得鲁棒性能保证,仍是理论上基础性且实践上重要的开放问题。本研究提出几何切片算法(GSA),这是首个在离线批处理场景中为该问题实现常数竞争比的非先知调度策略。GSA通过几何化阶段结构管理不确定性,该结构定期重启任务以限制内存暴露,并结合交错流水线机制平滑聚合内存消耗以实现高并发。我们证明GSA在通用实例中最多获得61.92的竞争比,在大内存场景下提升至32。该算法框架还衍生出先知版本——几何批处理算法(GBA),其在通用实例中达到10.67的近似比,在大内存场景下提升至6.75,较先前已知的最佳界限(超过9000)有显著改进。基于真实请求轨迹的数值实验表明,我们的算法在保持最坏情况理论保证的同时展现出鲁棒性能。