Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data augmentation method, leveraging a recent text-to-image model called Stable Diffusion, to expand the training set via high-quality generation of image-caption pairs. Extensive experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods, and particularly a significant boost when having fewer training instances. In addition, models trained on our augmented datasets also outperform prior unpaired image captioning methods by a large margin. Finally, further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data based on quality assessment.
翻译:图像描述作为一项重要的视觉-语言任务,通常需要大量精细标注的图像-文本对来学习图像与文本之间的底层对齐关系。本文提出了一种多模态数据增强方法,利用最新文本到图像模型Stable Diffusion,通过高质量生成图像-文本对来扩展训练集。在MS COCO数据集上的大量实验表明,我们的方法相较于多种基准方法具有显著优势,特别是在训练实例较少的情况下提升尤为明显。此外,基于增强数据集训练的模型在准确性上大幅超越先前无配对图像描述方法。最后,通过基于质量评估对生成数据进行有针对性过滤,可进一步提升训练效率与有效性。