Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. Designed with performance and safety in mind, Peek2 is Regex-free and delivers a $ 1.11\times $ improvement in overall throughput across the entire Byte-level BPE encoding process. This algorithm runs entirely on the CPU, has stable linear complexity $ O(n) $, and provides presegmentation results identical to those of the original Regex-based pretokenizer.
翻译:预分词是字节级BPE分词器中一个关键且串行的处理阶段。我们提出的新实现Peek2,可作为GPT-3、LLaMa-3和Qwen-2.5等模型中类似cl100k预分词器的直接替代方案。Peek2以性能与安全性为核心设计目标,无需依赖正则表达式,并在整个字节级BPE编码流程中实现了 $ 1.11\times $ 的整体吞吐量提升。该算法完全在CPU上运行,具有稳定的线性复杂度 $ O(n) $,且提供的预分割结果与基于正则表达式的原始预分词器完全一致。