Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.
翻译:增量学习(IL)在开放图文生成(OpenITG)中使模型能够在生成新图像的准确、上下文相关文本时,同时保持先前获取的知识。不同于以往研究,本文探讨了一个更实际的场景:随着环境变化,视觉数据的主导类别会随时间演变。在此背景下,我们提出了持续对齐的新概念,即逐步调整预训练视觉语言模型(VLM)中的对齐模块,以保持高质量的跨模态表征。基于这一思想,我们提出了高效持续对齐(ECA),一种新颖的无样例增量学习方法用于开放图文生成。其关键挑战在于使模型能够获取新的任务特定特征,同时最小化对已建立对齐的干扰,且无需访问先前任务的原始数据。为此,ECA采用了三种核心机制:自适应任务特定查询令牌的混合查询模块(MoQ)、基于费舍尔信息矩阵(FIM)指标动态扩展模型结构的费舍尔动态扩展(FeDEx),以及通过字典重放(DR)保留过往知识的嵌入字典。为评估ECA性能,我们构建了四个更贴合实际场景的新增量学习开放图文生成基准。实验结果表明,与基线方法相比,ECA显著缓解了灾难性遗忘问题并提升了增量学习性能。代码与基准数据集详见 https://github.com/Snowball0823/ECA。