Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased character-level modeling performance, and increased modeling complexity. To address these disadvantages without sacrificing performance, we propose SpaceByte, a novel byte-level decoder architecture that closes the performance gap between byte-level and subword autoregressive language modeling. SpaceByte consists of a byte-level Transformer model, but with extra larger transformer blocks inserted in the middle of the layers. We find that performance is significantly improved by applying these larger blocks only after certain bytes, such as space characters, which typically denote word boundaries. Our experiments show that for a fixed training and inference compute budget, SpaceByte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures.
翻译:分词技术在大语言模型中广泛使用,因其能显著提升性能。然而,分词技术也带来若干缺陷,例如性能偏差、对抗性脆弱性增加、字符级建模性能下降以及建模复杂度提升。为在不牺牲性能的前提下解决这些问题,我们提出SpaceByte——一种新型字节级解码器架构,能够缩小字节级与子词自回归语言建模之间的性能差距。SpaceByte包含一个字节级Transformer模型,但在中间层额外插入了更大的Transformer模块。我们发现,仅在特定字节(如通常表示词边界的空格字符)之后应用这些更大模块,能显著提升性能。实验表明,在固定训练与推理计算预算下,SpaceByte的性能优于其他字节级架构,并大致与使用分词的Transformer架构持平。