Token Management in Multi-Tenant AI Inference Platforms

Multi-tenant AI inference platforms must balance resource utilization against service-level guarantees under variable demand. Conventional approaches fail to achieve this balance: dedicated endpoints strand capacity on idle models, while rate limits ignore the heterogeneous cost of inference requests. We introduce \emph{token pools}, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency). Unlike rate limits, which govern request admission without regard to execution cost, token pools authorize both admission and autoscaling from the same capacity model, ensuring consistency between what is promised and what is provisioned. The abstraction captures burst modes across multiple dimensions invisible to conventional throttling. Dynamic per-entitlement limits on each burst dimension enable fine-grained control over resource consumption while permitting work-conserving backfill by low-priority traffic. The design supports priority-aware allocation, service tiers with differentiated guarantees, and debt-based fairness mechanisms, all without modifying the underlying inference runtime or cluster scheduler. In experiments on a Kubernetes cluster with vLLM backends, token pools maintain a bounded P99 latency for guaranteed workloads during overload by selectively throttling spot traffic, while a baseline without admission control experiences unbounded latency degradation across all workloads. A second experiment demonstrates debt-based fair-share convergence among elastic workloads with heterogeneous SLO requirements during capacity scarcity.

翻译：多租户AI推理平台必须在可变需求下平衡资源利用率与服务级别保证。传统方法无法实现这种平衡：专用端点在闲置模型上浪费容量，而速率限制则忽略了推理请求的异构成本。我们提出**令牌池**这一控制平面抽象，它将推理能力表示为以推理原生单位（令牌吞吐量、KV缓存、并发度）表达的显式授权。与仅控制请求准入而不考虑执行成本的速率限制不同，令牌池基于同一容量模型同时授权准入和自动扩缩，确保承诺内容与资源配置之间的一致性。该抽象能够捕捉传统节流机制无法感知的多维度突发模式。通过对每个突发维度设置动态的每授权限制，实现了对资源消耗的细粒度控制，同时允许低优先级流量进行工作保持型回填。该设计支持优先级感知分配、具有差异化保证的服务层级以及基于债务的公平机制，且无需修改底层推理运行时或集群调度器。在使用vLLM后端的Kubernetes集群实验中，令牌池通过选择性节流抢占式流量，在过载期间为有保证的工作负载维持有界的P99延迟；而未经准入控制的基线方案则会在所有工作负载上经历无界的延迟劣化。第二个实验展示了在容量稀缺期间，具有异构SLO要求的弹性工作负载之间基于债务的公平份额收敛。