Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model performance but is also the source of many undesirable behaviors, such as spurious ambiguity or inconsistency. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. In particular, the impact of tokenization on language model estimation has been investigated primarily through empirical means. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models. Based on the category of stochastic maps, this framework enables us to establish general conditions for a principled use of tokenizers and, most importantly, the necessary and sufficient conditions for a tokenizer model to preserve the consistency of statistical estimators. In addition, we discuss statistical and computational concerns crucial for designing and implementing tokenizer models, such as inconsistency, ambiguity, finiteness, and sequentiality. The framework and results advanced in this paper contribute to building robust theoretical foundations for representations in neural language modeling that can inform future theoretical and empirical research.
翻译:分词——将字母表中的字符序列转换为词汇表上的标记序列——是自然语言处理流程中的关键步骤。标记表示的使用被广泛认为提升了模型性能,但同时也是许多不良行为的根源,例如虚假歧义或不一致性。尽管分词作为自然语言处理的标准表示方法其重要性已得到公认,但其理论基础尚未被完全理解。特别是,分词对语言模型估计的影响主要通过实证手段进行研究。本文通过提出一个统一的表示与分析分词模型的形式化框架,致力于填补这一理论空白。基于随机映射范畴,该框架使我们能够为分词器的原则性使用建立一般条件,最重要的是,为分词器模型保持统计估计量的一致性建立必要与充分条件。此外,我们讨论了设计与实现分词器模型至关重要的统计与计算考量,例如不一致性、歧义性、有限性和序列性。本文提出的框架与结果有助于为神经语言建模中的表示建立坚实的理论基础,从而为未来的理论与实证研究提供参考。