Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
翻译:大规模视觉语言模型的后训练通常涉及知识注入的监督微调或性能增强的带可验证奖励的强化学习。然而,监督微调往往导致次优性能,而带可验证奖励的强化学习仍受限于模型内部知识库。虽然可采用顺序的监督微调→带可验证奖励的强化学习流程,但这会引入显著的计算开销并遭受灾难性遗忘问题。为应对这些局限,我们提出ViSurf(视觉监督与强化微调),一种统一、单阶段的范式,整合了监督微调与带可验证奖励的强化学习两者的优势。通过分析其训练目标,我们建立了一个统一框架,将真实标签直接注入带可验证奖励的强化学习轨迹中,实现外部监督与内部强化的同步进行。此外,我们引入三种新颖的奖励控制策略以确保训练稳定性与优化效果。大量实验表明,ViSurf在多样化基准测试中持续优于独立的监督微调、带可验证奖励的强化学习以及传统两阶段流程。深入分析佐证了这些发现,验证了ViSurf的推导与设计原则。