We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: https://github.com/ModelTC/mtc-inc-bpe
翻译:我们提出了一种新颖的增量字节对编码(BPE)分词算法。该算法以最坏情况下的$\mathcal{O}(\log^2 t)$时间处理每个输入字节,从而在输入长度为$n$且最大分词长度为$t$时,总体复杂度为$\mathcal{O}(n \log^2 t)$。该算法增量式地维护输入文本每个前缀的BPE分词结果,实现了由固定合并规则集定义的标准BPE合并过程。这使得在流式处理场景中能够实现高效部分分词。作为标准BPE的即插即用替代方案,我们的方法相比Hugging Face的分词器实现了高达${\sim}3$倍的加速,并在病态输入上相比OpenAI的tiktoken展现出显著的延迟降低。我们进一步引入了一种急迫输出算法,能够在增量分词过程中一旦确定分词边界便即时输出分词结果,从而实现流式输出。总体而言,我们的结果表明,BPE分词可以在具有强最坏情况保证的情况下增量执行,同时为现代大语言模型流水线提供实际延迟收益。代码:https://github.com/ModelTC/mtc-inc-bpe