Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.
翻译:字节对编码(BPE)等子词分词方法因其在词汇紧凑性与表征能力之间的平衡,被广泛应用于大规模语言模型。然而,这类方法在表示罕见词时效率较低,且需要庞大的嵌入矩阵。字符级模型虽能解决这些问题,但会引入性能瓶颈,尤其在基于Transformer的架构中更为明显。近期提出的分层模型试图通过将字符分组为片段来融合两种范式的优势,但现有的片段划分策略要么依赖空白字符(限制了在特定语言中的适用性),要么需要引入新依赖的辅助模型。本文提出一种动态字符分组方法,该方法利用现有BPE分词的结构,无需额外模型。通过向BPE词元添加显式的片段结束标记,并引入第二级BPE压缩阶段以控制片段粒度,本方法能够提供高效、灵活且与语言无关的表征。实验结果表明,该方法在保持词汇紧凑性的同时,其性能达到或超越了基于动态熵和空白字符的片段划分策略。