通过改进图文对齐实现图像描述的持续学习 (Continual Learning for Image Captioning through Improved Image-Text Alignment)

Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link https://github.com/ Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.

翻译：在持续学习场景下，由于灾难性遗忘以及随时间推移不断演变的视觉概念与语言对齐的困难，生成准确且连贯的图像描述仍然是一个重大挑战。本文提出一种新颖的持续图像描述多损失框架，该框架通过基于提示的持续学习和对比对齐整合语义引导。基于预训练的ViT-GPT-2主干网络，我们的方法将标准交叉熵损失与三个附加组件相结合：(1) 基于提示的余弦相似度损失，将图像嵌入与编码对象、属性和动作的合成构建提示对齐；(2) CLIP风格损失，促进图像嵌入与目标描述嵌入之间的对齐；(3) 语言引导的对比损失，采用三元组损失以增强任务间类别级别的可区分性。值得注意的是，我们的方法在推理时不引入额外开销，且在描述生成过程中无需提示。我们发现，与最先进的方法相比，该方法缓解了灾难性遗忘，同时实现了更好的语义描述对齐。代码可通过以下链接获取：https://github.com/Gepardius/Taetz_Bordelius_Continual_ImageCaptioning。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日