Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.
翻译:测试时间计算已成为提升大型语言模型(LLMs)推理能力的一种有前景的策略。然而,这一策略反过来增加了用户向提供LLM即服务的云服务商的付费,因为服务商根据生成输出所使用的测试时间计算量向用户收费。我们的研究表明,LLM即服务市场存在社会效率低下问题:服务商有经济动机增加测试时间计算量,即使这种增加对输出质量的贡献甚微。为解决这一效率问题,我们引入了一种反向第二价格拍卖机制,其中服务商对其服务用户的报价和(预期)质量进行竞价,用户则根据获胜服务商相对第二高出价者产生的边际价值按比例付费。为阐明并补充我们的理论结果,我们使用来自 $\texttt{Llama}$ 和 $\texttt{Qwen}$ 系列的多款指令模型,以及从 $\texttt{DeepSeek-R1}$ 蒸馏的推理模型,在数学和科学基准数据集上开展了实验。