Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce $\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.
翻译:当前视觉语言生成模型依赖于大规模的图文配对数据集以实现最优性能和泛化能力。然而,自动采集此类数据(如通过大规模网络爬取)会导致低质量和较差的图文相关性,而人工标注虽更精确但需要大量人力与高昂成本。我们提出$\textbf{ITIT}$($\textbf{图像文本集成}$):一种基于循环一致性概念的新型训练范式,允许在未配对的图像与文本数据上进行视觉语言训练。ITIT由联合图像文本编码器与分离的图像、文本解码器组成,可在单一框架内实现双向图像到文本及文本到图像的生成。训练过程中,ITIT利用少量配对图文数据确保两个方向上的输出与输入合理匹配。同时,模型也在仅包含图像或文本的更大规模数据集上训练,通过强制原始未配对样本与循环生成样本之间的循环一致性来实现。例如,对给定输入图像生成描述文本,再利用该文本生成输出图像,并强制输入与输出图像之间的相似性。实验表明,使用未配对数据集的ITIT展现出与高质量配对数据相似的扩展行为。我们证明,在图像生成和描述任务上,ITIT仅需三个数量级更少的配对数据(仅300万对)即可达到与最先进文本到图像及图像到文本模型相当的性能。