We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an $M/G/1$ queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time, subject to architectural token-budget constraints and queue-stability conditions. The objective function is shown to be strictly concave over the stability region, which ensures existence and uniqueness of the optimal token allocation. The first-order optimality conditions yield a coupled projected fixed-point characterization of the optimum, together with an iterative solution and an explicit sufficient condition for contraction. Moreover, a projected gradient method with a computable global step-size bound is developed to guarantee convergence beyond the contractive regime. Finally, integer-valued token allocations are attained via rounding of the continuous solution, and the resulting performance loss is evaluated in simulation results.
翻译:本文研究一个服务于异构查询流的大型语言模型(LLM)服务器,该查询流包含$N$个不同任务类型。查询依据泊松过程到达,每种类型以已知先验概率出现。针对每种任务类型,服务器分配固定数量的内部思考令牌,这决定了为该查询投入的计算资源。令牌分配会引发精度与延迟的权衡:服务时间近似为分配令牌数的仿射函数,而正确响应的概率则呈现收益递减特性。在先进先出(FIFO)服务规则下,系统作为$M/G/1$队列运行,平均系统时间取决于所得服务时间分布的一阶矩与二阶矩。我们构建了一个约束优化问题,其目标是在满足架构令牌预算约束和队列稳定性条件的前提下,最大化经平均系统时间惩罚的加权平均精度指标。该目标函数在稳定域内被证明是严格凹的,从而保证了最优令牌分配解的存在性与唯一性。一阶最优性条件导出了最优解的耦合投影定点表征,同时给出了迭代求解方法及收缩性的显式充分条件。此外,本文提出了具有可计算全局步长界的投影梯度法,以保证在非收缩区域仍能收敛。最后,通过对连续解进行舍入得到整数值令牌分配方案,并通过仿真结果评估了由此产生的性能损失。