Recent advances in retrieval-augmented models for image captioning highlight the benefit of retrieving related captions for efficient, lightweight models with strong domain-transfer capabilities. While these models demonstrate the success of retrieval augmentation, retrieval models are still far from perfect in practice: the retrieved information can sometimes mislead the model, resulting in incorrect generation and worse performance. In this paper, we analyze the robustness of a retrieval-augmented captioning model SmallCap. Our analysis shows that the model is sensitive to tokens that appear in the majority of the retrieved captions, and the input attribution shows that those tokens are likely copied into the generated output. Given these findings, we propose to train the model by sampling retrieved captions from more diverse sets. This decreases the chance that the model learns to copy majority tokens, and improves both in-domain and cross-domain performance.
翻译:近期检索增强模型在图像字幕生成领域取得的进展表明,检索相关字幕能够构建高效轻量且具备强领域迁移能力的模型。尽管这些模型展现了检索增强的成功,但实际应用中检索模型仍远非完美:检索到的信息有时会误导模型,导致生成结果错误和性能下降。本文分析了检索增强字幕生成模型SmallCap的鲁棒性。分析表明,该模型对出现在大多数检索结果中的词汇高度敏感,且输入归因显示这些词汇容易被复制到生成输出中。基于这些发现,我们提出通过从更多样化的集合中采样检索结果来训练模型。这降低了模型学习复制多数词汇的可能性,同时提升了域内和跨域性能。