Removing modeling constraints and unifying architectures across domains has been a key driver of the recent progress in training large multimodal models. However, most of these models still rely on many separately trained components such as modality-specific encoders and decoders. In this work, we further streamline joint generative modeling of images and text. We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data, without relying on any separately pretrained components, and can understand and generate both text and images. Specifically, we leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. The normalizing flow model serves as both an image encoder for perception tasks and an image decoder for image generation tasks during inference. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines. These baselines rely on pretrained image autoencoders, which are trained with a complex mixture of losses, including perceptual ones. At the same time, JetFormer demonstrates robust image understanding capabilities. To the best of our knowledge, JetFormer is the first model that is capable of generating high-fidelity images and producing strong log-likelihood bounds.
翻译:消除建模约束并统一跨领域架构一直是近期训练大型多模态模型取得进展的关键驱动力。然而,这些模型中的大多数仍然依赖于许多单独训练的组件,例如特定模态的编码器和解码器。在本工作中,我们进一步简化了图像与文本的联合生成建模。我们提出了一种自回归仅解码器Transformer——JetFormer——该模型被训练为直接最大化原始数据的似然,不依赖于任何单独预训练的组件,并且能够理解和生成文本与图像。具体而言,我们利用归一化流模型来获得一种软令牌图像表示,该表示与一个自回归多模态Transformer联合训练。在推理过程中,该归一化流模型既充当感知任务的图像编码器,也充当图像生成任务的图像解码器。JetFormer在文本到图像生成质量上达到了与近期基于VQ-VAE和VAE的基线模型相竞争的水平。这些基线模型依赖于预训练的图像自编码器,而这些自编码器使用复杂的混合损失(包括感知损失)进行训练。同时,JetFormer展示了强大的图像理解能力。据我们所知,JetFormer是首个能够生成高保真图像并产生强对数似然下界的模型。