MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis

Continual instruction tuning(CIT) during the post-training phase is crucial for adapting multimodal large language models (MLLMs) to evolving real-world demands. However, the progress is hampered by the lack of benchmarks with rigorous, protocol-consistent evaluation. To bridge this gap, we introduce MLLM-CTBench, a comprehensive benchmark for CIT of MLLMs, covering seven challenging tasks across six diverse domains. MLLM-CTBench makes three key contributions. First, we establish a multidimensional evaluation framework that jointly assesses final-answer accuracy and process-level reasoning quality, where Chain-of-Thought (CoT) traces serve as an observable signal to diagnose catastrophic forgetting beyond answer-only evaluation. Second, we conduct a large-scale evaluation of continual learning methods by systematically assessing eight representative algorithms from four major families under a unified protocol across task orders, providing actionable insights for algorithm design. Third, we expand the scope from Supervised Fine-Tuning (SFT) to Reinforcement Fine-Tuning (RFT) in CIT. By investigating GRPO, an on-policy RL algorithm that stabilizes updates through explicit KL-divergence control to a prior policy, we aim to analyze how this mechanism affects cross-task knowledge retention. Our experiments yield several findings:(1) Process-level reasoning quality is often more resilient to catastrophic forgetting than final-answer accuracy, and forgetting is primarily driven by degradation in domain knowledge. (2) Model capability is critical factor influencing continual learning outcomes, with stronger baseline models exhibiting greater resistance to catastrophic forgetting. (3) On-policy RFT (GRPO), with its inherent KL control, achieves more stable cross-task retention than SFT. While removing KL control can amplify forgetting despite potential gains on new ones.

翻译：持续指令微调（CIT）在训练后阶段对于使多模态大语言模型（MLLMs）适应不断演变的现实世界需求至关重要。然而，由于缺乏具有严格、协议一致性评估的基准，其进展受到阻碍。为弥补这一空白，我们提出了MLLM-CTBench，一个用于MLLMs持续指令微调的综合性基准，涵盖六个不同领域的七项挑战性任务。MLLM-CTBench做出了三项关键贡献。首先，我们建立了一个多维评估框架，联合评估最终答案的准确性和过程级推理质量，其中思维链（CoT）轨迹作为可观测信号，用以诊断超越仅答案评估的灾难性遗忘。其次，我们通过在一个统一协议下，跨任务顺序系统地评估来自四个主要家族的八种代表性算法，对持续学习方法进行了大规模评估，为算法设计提供了可操作的见解。第三，我们将CIT的范围从监督微调（SFT）扩展到强化微调（RFT）。通过研究GRPO——一种通过显式KL散度控制先验策略以实现稳定更新的在线策略RL算法，我们旨在分析该机制如何影响跨任务知识保留。我们的实验得出若干发现：（1）过程级推理质量通常比最终答案准确性对灾难性遗忘更具抵抗力，且遗忘主要由领域知识的退化驱动。（2）模型能力是影响持续学习结果的关键因素，更强的基线模型表现出对灾难性遗忘的更大抵抗力。（3）具有固有KL控制的在线策略RFT（GRPO）比SFT实现了更稳定的跨任务保留。尽管移除KL控制可能在新任务上获得潜在收益，但会加剧遗忘。