Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.

翻译：当前最先进的大型语言模型需要专用硬件和大量能源才能运行。因此，提供大型语言模型访问服务的云平台已变得非常普及。在这些服务中，用户为模型输出所支付的费用取决于模型生成该输出时使用的token数量：他们为每个token支付固定价格。本研究表明，这种定价机制为服务提供商创造了策略性操纵并虚报模型生成输出所用token数量的经济激励，而用户无法证明甚至无从知晓提供商是否在超额收费。然而，我们也证明，若要求不诚信的提供商必须公开模型生成过程的透明度，则在不引起怀疑的前提下进行最优虚报将变得困难。尽管如此，作为概念验证，我们开发了一种高效的启发式算法，使提供商能在不引起怀疑的情况下显著超额收费。关键的是，我们证明运行该算法的成本低于超额收费带来的额外收益，这凸显了当前按token付费定价机制下用户的脆弱性。此外，我们证明要消除策略性操纵的经济激励，定价机制必须根据token的字符数量进行线性定价。虽然这会使提供商在不同token上的利润率产生波动，但我们提出了一种简单方案，使得采用这种激励相容定价机制的提供商能够维持其在按token付费机制下的平均利润率。在研究过程中，为阐释和补充理论结果，我们使用$\texttt{Llama}$、$\texttt{Gemma}$和$\texttt{Ministral}$系列的多个大型语言模型，并基于LMSYS Chatbot Arena平台的输入提示进行了实验验证。