Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale

When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to $5.8\times$ and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of $4.2 \pm 1.6$ SLO-meeting requests/s with short P95 within tens of milliseconds of quota-tiered isolation. A predictor-noise sweep confirms graceful degradation under up to 60% multiplicative error. Heavy-dominated regimes separate policies on completion, tail, and interpretable shedding. We further compare short-priority allocation (biased toward interactive traffic) with Fair Queuing (round-robin across classes): Fair Queuing achieves +32% short-request P90 improvement over FIFO with only +17% long-request overhead, versus Short-Priority's +27% / +116% trade-off -- demonstrating that the allocation layer accommodates different fairness objectives without changing the remaining stack. We contribute the three-layer client-side decomposition, controlled evaluation of joint metrics across regimes, allocation-policy alternatives, and overload-policy evidence linking cost-ladder shedding to the stated service objective.

翻译：当输出令牌计数在提交时可预测时（Gan等人，2026），针对黑盒LLM API的客户端调度变为半先见性：即使提供者的内部机制仍被隐藏，决策也依赖于粗略的令牌先验。我们将这一边界问题分解为三个可分离的关注点：分配（通过自适应DRR实现类间份额）、排序（通过可行集评分实现类内排序）以及过载控制（在成本阶梯上显式接受/延迟/拒绝）。信息阶梯实验表明，有效的客户端控制的实际阈值是粗略量级先验（而非仅类别标签）；移除量级会使短请求P95膨胀高达$5.8\times$，并降低截止时间满足率。在平衡/高拥塞状态下，完整堆栈可实现100%完成率、100%截止时间满足率，以及$4.2 \pm 1.6$个满足SLO请求/秒的有用吞吐量，且短请求P95保持在配额分层隔离的数十毫秒内。预测器噪声扫描证实，在高达60%的乘法误差下仍能优雅降级。重度主导机制根据完成、尾部和可解释丢弃划分不同策略。我们进一步比较了短优先级分配（偏向交互式流量）与公平队列（类间轮询）：公平队列相对于FIFO实现了+32%的短请求P90改进，仅带来+17%的长请求开销，而短优先级的权衡为+27%/+116%——表明分配层可在不改变剩余堆栈的情况下适应不同公平性目标。我们贡献了三层客户端分解、跨机制联合度量的受控评估、分配策略替代方案，以及将成本阶梯丢弃与所述服务目标相联系的过载策略证据。