We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.
翻译:本文提出UGen,一种统一的自回归多模态模型,在文本处理、图像理解与图像生成任务上均展现出强大性能。UGen将文本与图像均转换为离散的token序列,并利用单一Transformer以自回归方式统一生成这些序列。为应对统一多模态学习中的挑战,UGen采用一种新颖的训练机制——渐进式词汇学习。在此过程中,视觉token的ID被逐步激活并融入训练阶段,最终提升了统一多模态学习的效能。在全面的文本与图像任务上的实验表明,相较于基础统一自回归方法,UGen实现了13.3%的整体性能显著提升,并且在所有任务上相较于多个专用模型均取得了具有竞争力的结果。