Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong,David Fan,John Nguyen,Ellis Brown,Gaoyue Zhou,Shengyi Qian,Boyang Zheng,Théophane Vallaeys,Junlin Han,Rob Fergus,Naila Murray,Marjan Ghazvininejad,Mike Lewis,Nicolas Ballas,Amir Bar,Michael Rabbat,Jakob Verbeek,Luke Zettlemoyer,Koustuv Sinha,Yann LeCun,Saining Xie

from arxiv, Project website at https://beyond-llms.github.io/

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

翻译：视觉世界为推进基础模型超越语言提供了关键维度。尽管这一方向日益受到关注，原生多模态模型的设计空间仍不透明。我们通过受控的从零开始预训练实验提供实证澄清，隔离了影响多模态预训练的因素而不受语言预训练干扰。我们采用Transfusion框架，使用下一词预测处理语言、扩散模型处理视觉，在包括文本、视频、图文对甚至动作条件视频的多样化数据上进行训练。实验得出四个关键发现：(i) 表征自编码器(RAE)通过同时在视觉理解和生成任务上表现优异，提供了最优的统一视觉表征；(ii) 视觉与语言数据具有互补性，能协同提升下游任务能力；(iii) 统一多模态预训练自然导向世界建模，通用训练中涌现出相关能力；(iv) 专家混合(MoE)架构能实现高效的多模态扩展，同时自然诱导出模态专业化。通过IsoFLOP分析，我们计算了双模态的缩放定律并揭示了缩放不对称性：视觉对数据的需求显著高于语言。我们证明MoE架构通过提供语言所需的高模型容量，同时适应视觉的数据密集型特性，协调了这种缩放不对称性，为真正统一的多模态模型开辟了道路。