This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach achieves a CIDEr score of 234.11.
翻译:本报告介绍了2024年NICE(零样本图像描述评估新前沿)挑战赛主题一“零样本图像描述”的解决方案。与2023年NICE数据集相比,本次挑战赛使用了人工标注的新数据,其描述风格和内容存在显著差异。为此,我们通过检索增强和描述分级方法有效提升了图像描述质量。在数据层面,我们利用图像描述模型生成的高质量描述作为训练数据,以弥合文本风格差异;在模型层面,我们采用OFA(基于手工模板的大规模视觉-语言预训练模型)执行图像描述任务。随后,我们针对图像描述模型生成的高质量描述数据提出描述级策略,并将其与检索增强策略整合到模板中,迫使模型基于检索增强提示生成质量更高、匹配度更强且语义更丰富的描述。我们的方法取得了234.11的CIDEr得分。