Providers of LLM-as-a-service have predominantly adopted a simple pricing model: users pay a fixed price per token. Consequently, one may think that the price two different users would pay for the same output string under the same input prompt is the same. In our work, we show that, surprisingly, this is not (always) true. We find empirical evidence that, particularly for non-english outputs, both proprietary and open-weights LLMs often generate the same (output) string with multiple different tokenizations, even under the same input prompt, and this in turn leads to arbitrary price variation. To address the problem of tokenization multiplicity, we introduce canonical generation, a type of constrained generation that restricts LLMs to only generate canonical tokenizations -- the unique tokenization in which each string is tokenized during the training process of an LLM. Further, we introduce an efficient sampling algorithm for canonical generation based on the Gumbel-Max trick. Experiments on a variety of natural language tasks demonstrate that our sampling algorithm for canonical generation is comparable to standard sampling in terms of performance and runtime, and it solves the problem of tokenization multiplicity.
翻译:LLM即服务提供商主要采用一种简单的定价模型:用户为每个token支付固定费用。因此,人们可能认为不同用户在相同输入提示下为相同输出字符串支付的费用是相同的。在我们的工作中,我们发现令人惊讶的是,这并非(总是)成立。我们找到了经验证据表明,特别是对于非英语输出,无论是专有模型还是开源权重的LLM,经常在相同输入提示下生成具有多种不同分词方式的相同(输出)字符串,这进而导致了任意的价格变动。为解决分词多样性的问题,我们引入了规范生成——一种约束生成方法,限制LLM仅生成规范分词,即在LLM训练过程中每个字符串被分词的唯一方式。此外,我们基于Gumbel-Max技巧提出了一种高效的规范生成采样算法。在多种自然语言任务上的实验表明,我们的规范生成采样算法在性能和运行时间方面与标准采样相当,并且解决了分词多样性的问题。