While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{\text{th}}$-order Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{\text{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.
翻译:尽管已有大量研究试图绕过语言建模中的分词步骤(Clark等,2022;Xue等,2022),当前共识认为分词仍是设计高性能语言模型所必需的初始步骤。本文从理论视角出发,通过研究Transformer在简单数据生成过程上的行为来探究分词机制。当在特定$k$阶马尔可夫过程($k>1$)生成的数据上训练时,Transformer呈现出一个令人惊讶的现象——在没有分词的情况下,它们经验性地无法学习正确的分布,而是根据一元模型预测字符(Makkuva等,2024)。然而,引入分词后,我们经验性地观察到Transformer突破了这一限制,能够以接近最优的方式对源序列的概率进行建模,实现较低的交叉熵损失。以此发现为起点,我们系统研究了有无分词情况下Transformer实现的端到端交叉熵损失。通过适当的词表切分,我们证明即使是最简单的一元模型(基于词元),Transformer也能近乎最优地对$k$阶马尔可夫源生成的序列概率进行建模。我们的分析通过研究Transformer在马尔可夫数据上的行为,为实践中使用分词提供了理论依据。