Modern language models are internally -- and mathematically -- distributions over token strings rather than \emph{character} strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent analyses are very sensitive to the specification of the prompt (e.g., if the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. We find that -- even with a small computation budget -- our method is able to accurately approximate the character-level distribution (less than 0.00021 excess bits / character) at reasonably fast speeds (46.3 characters / second) on the Llama 3.1 8B language model.
翻译:现代语言模型在内部——以及数学上——是词元字符串而非\emph{字符}字符串上的分布,这为在其之上构建用户应用程序的程序员带来了诸多挑战。例如,若提示词被指定为字符字符串,则必须将其进行词元化处理,然后才能传递给词元级别的语言模型。因此,分词器及后续分析对提示词的规范(例如,提示词是否以空格结尾)非常敏感。本文提出了将词元级别语言模型转换为字符级别语言模型的算法。我们同时提出了精确算法和近似算法。在论文的实证部分,我们对实际运行时间和近似质量进行了基准测试。我们发现,即使在较小的计算预算下,我们的方法也能以较快的速度(46.3 字符/秒)在 Llama 3.1 8B 语言模型上精确地近似字符级别的分布(每字符额外比特数小于 0.00021)。