In this work, we introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective, thereby enabling the model to process image and text as seamlessly as a language model processes text. To accomplish this, we initially propose a novel image tokenizer-detokenizer framework for visual data, specifically designed to transform raw images into a sequence of continuous embeddings and reconstruct them accordingly. In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model. Consequently, VL-GPT can perform large-scale pre-training on multimodal corpora utilizing a unified auto-regressive objective (i.e., next-token prediction). Upon completion of pre-training, VL-GPT exhibits remarkable zero-shot and few-shot performance across a diverse range of vision and language understanding and generation tasks, including image captioning, visual question answering, text-to-image generation, and more. Additionally, the pre-trained model retrains in-context learning capabilities when provided with multimodal prompts. We further conduct instruction tuning on our VL-GPT, highlighting its exceptional potential for multimodal assistance. The source code and model weights shall be released.
翻译:本文提出视觉-语言生成式预训练Transformer(VL-GPT),一种能够同时感知与生成视觉和语言数据的Transformer模型。VL-GPT通过采用简单的自回归目标,实现了图像和文本模态的统一预训练方法,使模型能像语言模型处理文本一样无缝处理图像和文本。为此,我们首先提出一种新颖的图像分词器-解分词器框架,专门设计用于将原始图像转化为连续嵌入序列并相应重构图像。结合现有的文本分词器与解分词器,该框架支持将交错图像-文本数据编码为多模态序列,进而输入Transformer模型。因此,VL-GPT能够利用统一的自回归目标(即下一个token预测)在多模态语料库上进行大规模预训练。预训练完成后,VL-GPT在多种视觉与语言理解及生成任务(包括图像描述、视觉问答、文本到图像生成等)中展现出卓越的零样本和少样本性能。此外,预训练模型在提供多模态提示时保有上下文学习能力。我们进一步对VL-GPT进行指令微调,凸显其多模态辅助的非凡潜力。源代码和模型权重将公开发布。