While most frontier models still use deterministic frequency-based tokenization algorithms such as byte-pair encoding (BPE), there has been significant recent work to design learned neural tokenizers. However, these schemes generally add to underlying language model complexity and force large changes to architecture, making them hard to implement at large scales. To overcome these challenges, we propose the gated quantized variational autoencoder (GQ-VAE), a novel architecture that can be independently pre-trained to serve as a drop-in replacement for existing tokenizers. The key innovation of the architecture is to learn to encode variable-length discrete tokens. GQ-VAE improves compression and language modeling performance over a standard VQ-VAE tokenizer, and approaches the compression rate and language modeling performance of BPE. Interestingly, if we use BPE with a smaller vocabulary, such that the compression is equivalent between GQ-VAE and BPE, we find that GQ-VAE improves downstream language model learning. We conclude with a discussion of several exciting avenues for future work. Code can be found at https://github.com/Theo-Datta-115/gq-vae.
翻译:尽管大多数前沿模型仍在使用基于频率的确定性令牌化算法,如字节对编码(BPE),但近期已有大量研究工作致力于设计基于学习的神经令牌化器。然而,这些方案通常会增加底层语言模型的复杂性,并迫使架构发生重大改变,使其难以大规模实现。为克服这些挑战,我们提出了一种门控量化变分自编码器(GQ-VAE),这是一种新颖的架构,可以独立预训练,作为现有令牌化器的即插即用替代品。该架构的关键创新在于学习编码可变长度的离散令牌。与标准的VQ-VAE令牌化器相比,GQ-VAE在压缩和语言建模性能上均有所提升,并接近BPE的压缩率和语言建模性能。有趣的是,如果我们使用一个较小词汇表的BPE,使得GQ-VAE和BPE之间的压缩率相当,我们发现GQ-VAE能改善下游语言模型的学习效果。最后,我们讨论了未来工作的几个令人兴奋的方向。代码可在 https://github.com/Theo-Datta-115/gq-vae 找到。