Recent advancements in retrieval-augmented models for image captioning highlight the significance of retrieving related captions for efficient, lightweight models with strong domain-transfer capabilities. While these models demonstrate the success of retrieval augmentation, retrieval models are still far from perfect in practice. Retrieved information can sometimes mislead the model generation, negatively impacting performance. In this paper, we analyze the robustness of the SmallCap retrieval-augmented captioning model. Our analysis shows that SmallCap is sensitive to tokens that appear in the majority of the retrieved captions, and integrated gradients attribution shows that those tokens are likely copied into the final caption. Given these findings, we propose to train the model by sampling retrieved captions from more diverse sets. This reduces the probability that the model learns to copy majority tokens and improves both in-domain and cross-domain performance effectively.
翻译:检索增强模型在图像描述领域的最新进展凸显了检索相关描述对于构建高效、轻量且具备强大领域迁移能力模型的重要性。尽管这些模型展示了检索增强的成功,但在实际应用中,检索模型仍远非完美。检索到的信息有时可能误导模型生成,对性能产生负面影响。本文分析了SmallCap检索增强描述模型的鲁棒性。我们的分析表明,SmallCap对检索到的描述中出现频率较高的词汇(tokens)较为敏感,积分梯度归因分析显示这些词汇很可能被复制到最终描述中。基于这些发现,我们提出通过从更多样化的集合中采样检索描述来训练模型。这降低了模型学习复制高频词汇的概率,并有效提升了域内和跨域性能。