Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference compute budget, SpaceByte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.
翻译:分词化因其能显著提升性能而被广泛应用于大语言模型中。然而,分词化也带来若干弊端,例如性能偏差、对抗脆弱性增加、字符级建模性能下降以及建模复杂度提升。为在不牺牲性能的前提下解决这些弊端,我们提出SpaceByte——一种新颖的字节级解码器架构,旨在弥合字节级与子词自回归语言建模之间的性能差距。SpaceByte由字节级Transformer模型构成,但在网络层中间插入了额外的更大规模Transformer模块。我们发现,仅对特定字节(如通常表示词边界的空格字符)应用这些更大模块能显著提升性能。实验表明,在固定的训练与推理计算预算下,SpaceByte优于其他字节级架构,并与采用分词化的Transformer架构性能大致相当。