In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.
翻译:近年来,语言模型和文本到图像模型的大规模预训练进展彻底革新了机器学习领域。然而,将这两种模态整合到一个能够生成无缝多模态输出的强健单一模型中,仍然是一项重大挑战。为应对这一空白,我们提出了联合自回归混合(JAM)框架,这是一种模块化方法,系统性地融合了现有的文本和图像生成模型。我们还引入了一种专门针对混合模态生成任务的高效指令调优策略。最终经过指令调优的模型在生成高质量多模态输出方面展现出无与伦比的性能,并成为首个专为此目标设计的模型。