Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link https://github.com/ Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.
翻译:在持续学习场景下,由于灾难性遗忘以及随时间推移不断演变的视觉概念与语言对齐的困难,生成准确且连贯的图像描述仍然是一个重大挑战。本文提出一种新颖的持续图像描述多损失框架,该框架通过基于提示的持续学习和对比对齐整合语义引导。基于预训练的ViT-GPT-2主干网络,我们的方法将标准交叉熵损失与三个附加组件相结合:(1) 基于提示的余弦相似度损失,将图像嵌入与编码对象、属性和动作的合成构建提示对齐;(2) CLIP风格损失,促进图像嵌入与目标描述嵌入之间的对齐;(3) 语言引导的对比损失,采用三元组损失以增强任务间类别级别的可区分性。值得注意的是,我们的方法在推理时不引入额外开销,且在描述生成过程中无需提示。我们发现,与最先进的方法相比,该方法缓解了灾难性遗忘,同时实现了更好的语义描述对齐。代码可通过以下链接获取:https://github.com/Gepardius/Taetz_Bordelius_Continual_ImageCaptioning。