Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
翻译:近期字节级语言模型在无需依赖子词词表的情况下,已达到与词元级模型相当的性能,但其实用性受限于逐字节自回归生成的缓慢速度。我们通过新的训练和生成技术,解决了字节隐式Transformer(BLT)中的这一瓶颈。首先,我们提出BLT扩散模型(BLT-D),这是一种新型模型,也是我们最快的BLT变体,该模型在标准下一字节预测损失之外,辅以基于块的扩散目标进行训练。这使得推理过程中每解码一步即可并行生成多个字节,大幅减少了生成序列所需的前向传播次数。其次,受推测解码启发,我们提出了两种扩展方法,在部分速度增益与更高生成质量之间进行权衡:BLT自我推测(BLT-S),其中BLT的局部解码器在其常规补丁边界之外继续生成以草拟字节,随后通过单次完整模型前向传播进行验证;以及BLT扩散+验证(BLT-DV),它在基于扩散的生成之后增加自回归验证步骤,对BLT-D进行增强。在生成任务中,所有方法的内存带宽成本估计均比BLT降低超过50%。每种方法各有独特优势,共同消除了字节级语言模型实际应用的关键障碍。