Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling. Project Page: https://kwen-chen.github.io/SPECS-VL/
翻译:具有可验证奖励的强化学习(RL)最近催生了一波将RL引入视觉语言模型的"MLLM-r1"方法。大多数代表性范式始于冷启动,通常在RL之前采用监督微调(SFT)来初始化策略。然而,基于SFT的冷启动采用了任务解决方案与输出格式交织的推理范式,这可能导致指令风格过拟合,削弱分布外泛化能力,并最终影响下游RL。我们从训练方法和数据构建两个视角重新审视冷启动,并引入泛化因子(GF)系数来量化不同方法下的泛化能力。我们的实证研究发现,在冷启动中,基于偏好的训练方法(如DPO)比基于SFT的方法具有更好的泛化能力。受此启发,我们提出了SPECS——一个基于自蒸馏偏好的冷启动框架,用于解耦多模态学习:(1)通过自蒸馏生成内省偏好数据对,避免依赖更大的教师模型或手动标注;(2)执行基于偏好的训练以学习浅层、可迁移的表面形式标准(格式、结构、风格),而非记忆内容;(3)将具有可验证奖励的RL用于深度推理结果。在多个多模态基准测试上的实验结果表明,我们的解耦学习框架相比强基线模型取得了一致的性能提升,在MEGA-Bench上提高了4.1%,在MathVista上提高了12.2%。额外实验表明,SPECS有助于减少分布内的"停滞"现象,改善探索能力,稳定训练过程,并提升性能上限。项目页面:https://kwen-chen.github.io/SPECS-VL/