While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.
翻译:尽管基于大规模配对数据集的多模态图文数据生成建模已得到积极发展,但通过单一模型同时生成图像与文本数据(而非以固定模态为条件生成另一模态)的尝试仍然有限。本文探索了一种能够同时生成图像与文本序列的统一生成式视觉-语言模型。我们特别提出了一种基于非自回归掩码预测的生成式视觉语言Transformer——MAGVLT,并将其与自回归生成式视觉语言Transformer(ARGVLT)进行对比。相较于ARGVLT,所提出的MAGVLT实现了双向上下文编码、通过迭代细化并行令牌预测的快速解码,以及图像与文本填充等扩展编辑能力。为从零开始对图像-文本对进行严格训练,我们融合了图像到文本、文本到图像以及联合图像-文本掩码预测任务。此外,我们设计了基于步骤展开掩码预测和混合图像-文本对选择性预测的两种附加任务。在视觉-语言基准的各类下游生成任务上的实验结果表明,MAGVLT在显著提升推理速度的同时,其性能大幅超越ARGVLT。特别地,MAGVLT通过一个中等规模模型(参数少于5亿)在MS-COCO数据集上的零样本图像到文本与文本到图像生成任务中均取得了具有竞争力的结果,且无需使用单模态数据与网络。