Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and "word" boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is "backed-off" to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.
翻译:语言模型(LM)在词元序列上进行训练,而用户通过文本与LM交互。这种不匹配导致了部分词元问题:当用户在预期下一个词元的中间位置结束其提示时,会导致下一个词元预测失真。尽管已有研究使用任意字符前缀探讨过该问题,但在遵循单词边界的真实提示中,该问题的普遍性和严重性仍未得到充分探索。本工作中,我们识别出三种词元与“单词”边界经常不匹配的领域:不使用空格的语言、高度复合型语言以及代码。以中文为例,高达25%的单词边界与词元边界不一致,使得即使是自然、完整的单词提示也易受此问题影响。我们系统性地构建了以部分词元结尾的语义自然提示;实验发现,这构成了严重的失效模式:与提示“回退”至词元对齐的情况相比,前沿LM对正确续写的概率估计持续降低三个数量级。这种性能退化不随模型规模减小,反而在更大模型中常常加剧。最后,我们评估了推理阶段针对部分词元问题的缓解策略,并验证了近期精确解决方案的有效性。总体而言,我们揭示了真实使用场景中分词导致概率失真的规模与严重性,并为模型推理服务提供方提出了实用建议。