Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding -- unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.
翻译:摘要:自回归Transformer在短序列建模方面表现卓越,但难以扩展至高分辨率图像、播客、代码或书籍等长序列。我们提出了Megabyte——一种多尺度解码器架构,能够对超过一百万字节的序列进行端到端可微建模。Megabyte将序列分割为数据块,在块内使用局部子模型,在块间使用全局模型。这实现了次二次自注意力机制,在相同计算量下支持更大规模的前馈层,并提升了解码阶段的并行性——从而在训练和生成过程中以更低成本获得更优性能。大量实验表明,Megabyte使得字节级模型在长上下文语言建模中能与子词模型竞争,在ImageNet上实现最先进的密度估计,并可直接从原始文件对音频进行建模。这些结果共同证明了免分词大规模自回归序列建模的可行性。