Recent advances in image captioning are mainly driven by large-scale vision-language pretraining, relying heavily on computational resources and increasingly large multimodal datasets. Instead of scaling up pretraining data, we ask whether it is possible to improve performance by improving the quality of the samples in existing datasets. We pursue this question through two approaches to data curation: one that assumes that some examples should be avoided due to mismatches between the image and caption, and one that assumes that the mismatch can be addressed by replacing the image, for which we use the state-of-the-art Stable Diffusion model. These approaches are evaluated using the BLIP model on MS COCO and Flickr30K in both finetuning and few-shot learning settings. Our simple yet effective approaches consistently outperform baselines, indicating that better image captioning models can be trained by curating existing resources. Finally, we conduct a human study to understand the errors made by the Stable Diffusion model and highlight directions for future work in text-to-image generation.
翻译:近年来,图像描述领域的进展主要源于大规模视觉-语言预训练,这高度依赖计算资源和日益庞大的多模态数据集。我们并未追求扩大预训练数据规模,而是探究能否通过提升现有数据集中样本质量来改进模型性能。为此,我们提出两种数据策展方案:第一种假设部分样本因图文错配需被剔除;第二种假设可通过替换图像解决错配问题(采用最先进的Stable Diffusion模型实现)。我们在MS COCO和Flickr30K数据集上,使用BLIP模型在微调与少样本学习场景下评估了这些方法。这种简洁高效的方案持续超越基线模型,表明通过策展现有资源可训练更优的图像描述模型。最后,我们开展人类研究以解析Stable Diffusion模型的错误模式,并指出文本到图像生成领域的未来研究方向。