When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
翻译:在大规模数据集上训练时,图像描述模型能够理解通用领域图像的内容,但常常无法生成准确、详细的描述。为提升性能,预训练加微调一直是图像描述的关键策略。然而,我们发现图像与文本之间的大规模双向训练能够实现零样本图像描述。本文提出了BITTERS(大规模图像文本双向训练框架),一种用于零样本图像描述的高效训练与推理框架。我们还提出了一个新的评估基准,包含高质量数据集和全面指标集,用于恰当评估零样本描述的准确性和社会偏见。此外,我们提供了一种用于关键词提取的高效微调方法。研究表明,精心选择大规模训练集和模型架构是实现零样本图像描述的关键。