The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.
翻译:高质量人工标注图像-描述数据集的创建是视觉-语言模型(VLM)发展中的重大瓶颈。我们提出一种新颖方法,利用大语言模型(LLM)和图像生成模型的优势,生成合成图像-文本对,以实现高效且有效的VLM训练。该方法首先预训练一个文本到图像模型,以从LLM生成的描述中合成图像嵌入,随后利用这些合成对训练VLM。大量实验表明,基于合成数据训练的VLM在图像描述任务上表现与仅使用人工标注数据训练的模型相当,但所需数据量仅为后者的一小部分。具体而言,通过合成数据集进行增强,我们将基线性能提升了17%。此外,我们证明在图像嵌入空间中进行合成比在像素空间中快25%。本研究提出了一种生成大规模可定制图像数据集的潜力技术,可提升VLM性能并拓宽其跨领域适用性,同时实现更高的数据效率与资源利用率。