The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.
翻译:大型语言模型(LLMs)的巨大成功已扩展到多模态领域,在图像理解和生成任务上取得了卓越的性能。近期,旨在开发统一多模态大语言模型(MLLMs)以整合这些能力的研究已展现出有希望的结果。然而,现有方法通常在模型架构或训练流程上涉及复杂设计,增加了模型训练与扩展的难度。本文提出SynerGen-VL,一个简单而强大的无编码器MLLM,能够同时进行图像理解和生成。为解决现有无编码器统一MLLMs中发现的挑战,我们引入了令牌折叠机制和基于视觉专家的渐进对齐预训练策略,该策略在有效支持高分辨率图像理解的同时降低了训练复杂度。在通过统一的下一令牌预测目标,在大规模混合图文数据上进行训练后,SynerGen-VL在相当或更小的参数量下,达到或超越了现有无编码器统一MLLMs的性能,并缩小了与特定任务最先进模型之间的差距,为未来统一MLLMs的发展指明了一条有前景的路径。我们的代码和模型将予以发布。