Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
翻译:统一多模态模型近年来在能力与通用性方面展现出显著提升,然而多数主流系统仍需从头训练且依赖大量计算资源。本文研究表明,通过策略性地融合专用于生成或理解的公开可用模型,能够以更高效率获得具有竞争力的性能。我们的核心设计是在保留原始模块的同时,在网络中交错插入多模态自注意力模块。这种双融合机制(1)在有效实现丰富多模态融合的同时,最大程度保留了基础模型的原始优势;(2)促进了理解编码器的高层语义表征与生成编码器的低层空间信号的协同融合。该方法仅使用约350亿词元进行训练,便在多项基准测试中取得优异结果:组合式文本到图像生成任务GenEval得分0.91,复杂文本到图像生成任务DPG-Bench得分82.16,图像编辑任务GEditBench得分6.06,ImgEdit-Bench得分3.77。通过完整开源代码、模型权重及数据集,我们希望为统一多模态建模的未来研究提供支持。