Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm
翻译:字节构成数字世界的基础,因此是多模态基础模型极具前景的构建单元。近期兴起的字节语言模型虽已突破词元化的限制,但字节流过长的问题仍需新的架构范式。为此,我们提出多尺度字节语言模型——一种与模型无关的层次化解码器堆栈,可在单GPU上以全模型精度实现500万字节上下文窗口的训练。我们系统评估了基于Transformer与Mamba模块的MBLM在单模态与多模态任务上的性能。实验表明,混合架构能高效处理训练过程中的极长字节序列,同时实现近乎线性的生成效率。据我们所知,本研究首次在视觉问答任务上评估BLM性能,发现尽管采用图像序列化处理且未配备编码器,仅通过纯下一词元预测的MBLM仍能达到配备专用分类头的定制CNN-LSTM架构水平。我们证明MBLM在整合像素与图像文件流字节等多样化数据表征方面展现出强大适应性,彰显其通向全模态基础模型的潜力。源代码已公开于:https://github.com/ai4sd/multiscale-byte-lm