As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that follows GPT-2's merge rules. It includes a basic BlockBPE-style kernel and a faster, optimized version that uses cuCollections static map, CUB reductions, and a pybind11 interface for Python. On WikiText103 sequences up to 131k tokens, the optimized GPU tokenizer produces the same tokens as a CPU version and, for the longest inputs, is about 1.7x faster than tiktoken and about 7.6x faster than the HuggingFace GPT-2 tokenizer. Nsight profiling shows that 70-80% of CUDA API time goes to memory allocation, so adding memory pooling should give the biggest speed boost next. Tests on generation tasks using WikiText103 prompts show that our GPU tokenizer's outputs stay within about one percentage point of tiktoken and HuggingFace GPT-2 on similarity and overlap metrics, meaning it keeps output quality while making long-context inference more practical.
翻译:随着大语言模型向百万级上下文窗口发展,CPU分词器因其逐步骤处理文本而成为主要性能瓶颈,而此时强大的GPU却处于闲置状态。我们构建了一个基于GPU的字节级BPE分词器,其遵循GPT-2的合并规则。该分词器包含一个基础的BlockBPE风格内核,以及一个更快的优化版本——后者采用cuCollections静态映射、CUB归约操作和用于Python的pybind11接口。在长达131k个标记的WikiText103序列上,优化后的GPU分词器产生的标记与CPU版本相同,且对于最长输入,其速度约为tiktoken的1.7倍,约为HuggingFace GPT-2分词器的7.6倍。Nsight性能分析显示,70-80%的CUDA API时间消耗在内存分配上,因此下一步增加内存池应能带来最大的速度提升。使用WikiText103提示进行的生成任务测试表明,在相似性和重叠度指标上,我们的GPU分词器输出结果与tiktoken及HuggingFace GPT-2分词器的差异保持在约一个百分点以内,这意味着它在保持输出质量的同时,使长上下文推理变得更加实用。