State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that, for encoding schemes such as maximum prefix matching, tokenization induces a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, we propose a novel algorithm to obtain unbiased estimates from a model that was trained on tokenized data. Our method does not require finetuning the model, and its complexity, defined as the number of model runs, scales linearly with the sequence length. As a consequence, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.
翻译:当前最先进的语言模型是自回归的,且基于称为词元的子词单元进行操作。具体而言,必须将条件字符串编码为一个词元列表,然后传递给语言模型以进行下一个词元的预测。我们证明,对于最大前缀匹配等编码方案,分词会引入一种采样偏差,这种偏差无法通过更多的训练或数据来缓解。为了应对这一普遍问题,我们提出了一种新颖的算法,用于从在分词数据上训练的模型中获得无偏估计。我们的方法不需要对模型进行微调,其复杂度(定义为模型运行次数)随序列长度线性增长。因此,我们证明了可以从一个分词语言模型中模拟无分词行为。我们通过一个马尔可夫链设置,从经验上验证了我们方法的正确性:与直接将词元提示输入语言模型的传统方法不同,我们的方法能准确地恢复转移概率。