Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds. Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.
翻译:推理时缩放已成为提升大语言模型性能的关键途径,但其实际部署受限于严格的计算预算。本文将推理预算分配形式化为一个受经济学原理约束的全局优化问题。通过采用移位激增函数对单次查询的推理效用进行建模,我们推导出基于全局阴影价格的最优分配策略,该价格在资源稀缺条件下均衡边际效用。基于这一理论,我们提出约束潜效用均衡推理分配算法(CLEAR)。该算法执行理性放弃,将资源从无法偿付的查询重新分配至接近其涌现阈值的可解查询。在不同流量模式的多个推理任务上的广泛实验表明,CLEAR显著提升了总token成本与平均准确率的帕累托前沿。在资源稀缺场景下,与均匀分配相比,CLEAR在全局准确率上实现了高达3倍的提升。