Image captioning models are typically trained by treating all samples equally, neglecting to account for mismatched or otherwise difficult data points. In contrast, recent work has shown the effectiveness of training models by scheduling the data using curriculum learning strategies. This paper contributes to this direction by actively curating difficult samples in datasets without increasing the total number of samples. We explore the effect of using three data curation methods within the training process: complete removal of an sample, caption replacement, or image replacement via a text-to-image generation model. Experiments on the Flickr30K and COCO datasets with the BLIP and BEiT-3 models demonstrate that these curation methods do indeed yield improved image captioning models, underscoring their efficacy.
翻译:图像描述生成模型通常通过平等对待所有样本来进行训练,忽略了匹配不当或难以处理的数据点。相比之下,近期研究表明,采用课程学习策略调度数据训练模型具有显著成效。本文通过在不增加总样本量的前提下主动策展数据集中的困难样本,进一步推进了这一研究方向。我们探讨了在训练过程中应用三种数据策展方法的效果:完全移除样本、替换描述文本,以及通过文本到图像生成模型替换图像。基于Flickr30K和COCO数据集,结合BLIP和BEiT-3模型的实验表明,这些策展方法确实能够提升图像描述生成模型的性能,验证了其有效性。