JetFormer: An Autoregressive Generative Model of Raw Images and Text

Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.

翻译：消除建模约束并统一跨领域架构一直是近期训练大型多模态模型取得进展的关键驱动力。然而，这些模型中的大多数仍然依赖于许多单独训练的组件，例如特定模态的编码器和解码器。在本工作中，我们进一步简化了图像与文本的联合生成建模。我们提出了一种自回归仅解码器Transformer——JetFormer——该模型被训练为直接最大化原始数据的似然，不依赖于任何单独预训练的组件，并且能够理解和生成文本与图像。具体而言，我们利用归一化流模型来获得一种软令牌图像表示，该表示与一个自回归多模态Transformer联合训练。在推理过程中，该归一化流模型既充当感知任务的图像编码器，也充当图像生成任务的图像解码器。JetFormer在文本到图像生成质量上达到了与近期基于VQ-VAE和VAE的基线模型相竞争的水平。这些基线模型依赖于预训练的图像自编码器，而这些自编码器使用复杂的混合损失（包括感知损失）进行训练。同时，JetFormer展示了强大的图像理解能力。据我们所知，JetFormer是首个能够生成高保真图像并产生强对数似然下界的模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/