Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the limitations of such a tokenization strategy, particularly for documents not written in English and for representing numbers. On the other extreme, byte/character-level language models are much less restricted but suffer from increased sequence description lengths and a subsequent quadratic expansion in self-attention computation. Recent attempts to compress and limit these context lengths with fixed size convolutions is helpful but completely ignores the word boundary. This paper considers an alternative 'learn your tokens' scheme which utilizes the word boundary to pool bytes/characters into word representations, which are fed to the primary language model, before again decoding individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizer outperform by over 300% both subwords and byte/character models over the intrinsic language modeling metric of next-word prediction across datasets. It particularly outshines on rare words, outperforming by a factor of 30! We extensively study the language modeling setup for all three categories of tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness.
翻译:语言模型通常将文本切分为子词,使用确定性的人工设计启发式方法将字符组合成更长的表面级字符串(如'ing'或完整词汇)。近期文献反复表明此类分词策略存在局限性,尤其对于非英语文档和数字表示。另一极端情况是,字节/字符级语言模型虽约束更少,但面临序列描述长度增加及自注意力计算量呈二次方膨胀的问题。近期采用固定尺寸卷积压缩上下文长度的尝试虽有助益,却完全忽略了词边界。本文提出一种替代性“学会你的Token”方案,利用词边界将字节/字符池化为词表示,输入主语言模型后,再并行逐词解码各字符/字节。实验表明,我们兼具中等表达力与速度的端到端分词器,在跨数据集的本质语言建模指标(下一词预测)上,性能超越子词与字节/字符模型达300%以上。尤其针对罕见词,其性能提升高达30倍!我们深入研究了所有三类分词器的语言建模配置,并从理论上论证了端到端模型在效率与鲁棒性之间取得的强平衡。