From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.

翻译：字节对编码（BPE）等子词分词方法因其在词汇紧凑性与表征能力之间的平衡，被广泛应用于大规模语言模型。然而，这类方法在表示罕见词时效率较低，且需要庞大的嵌入矩阵。字符级模型虽能解决这些问题，但会引入性能瓶颈，尤其在基于Transformer的架构中更为明显。近期提出的分层模型试图通过将字符分组为片段来融合两种范式的优势，但现有的片段划分策略要么依赖空白字符（限制了在特定语言中的适用性），要么需要引入新依赖的辅助模型。本文提出一种动态字符分组方法，该方法利用现有BPE分词的结构，无需额外模型。通过向BPE词元添加显式的片段结束标记，并引入第二级BPE压缩阶段以控制片段粒度，本方法能够提供高效、灵活且与语言无关的表征。实验结果表明，该方法在保持词汇紧凑性的同时，其性能达到或超越了基于动态熵和空白字符的片段划分策略。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日