Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
翻译:指令调优已成为定制大型视觉语言模型以满足特定任务需求的普遍技术。迄今为止,现有方法大多局限于单任务适应,而现实场景的需求本质上是多样且持续演变的。因此,理想的LVLM应能在面对流式任务分布(即不同领域、新兴能力与新数据集)时维持持续指令调优,同时最小化对已获知识的遗忘。为实现这一目标,我们提出了面向LVLM的持续指令调优新基准COAST,该基准涵盖前述领域增量、能力增量与数据集增量配置。在方法论层面,我们提出持续LLaVA——一种专为LVLM持续指令调优设计的无排练方法。为规避经验回放带来的额外开销,我们冻结LVLM参数并为每个输入指令构建双重增量嵌入以实现参数高效调优。具体而言,增量嵌入可分解为两个核心组件:1)用于编码任务特定特征的内在增量嵌入。为此,我们建立包含候选嵌入的低秩池,并依据其与用户指令的相似度进行筛选;2)用于探究任务间依赖关系的上下文增量嵌入。在此方面,通过可学习的加权聚合对先前任务中选取的低秩嵌入进行整合,以提供互补提示。大量实验表明,所提出的持续LLaVA在持续指令调优过程中显著降低了遗忘效应,性能优于现有方法。