Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
翻译:对大型视觉-语言模型(LVLMs)进行后训练通常涉及监督式微调(SFT)以注入知识,或采用可验证奖励的强化学习(RLVR)以提升性能。然而,SFT往往导致性能欠佳,而RLVR仍受限于模型内部知识库。尽管可采用顺序式SFT→RLVR流程,但该方法会引入显著的计算开销并遭受灾难性遗忘。为克服这些局限,我们提出ViSurf(视觉监督与强化微调),这是一种统一单阶段范式,融合了SFT与RLVR的双重优势。通过分析两者的训练目标,我们构建了一个统一框架,将真实标签直接注入RLVR生成序列中,从而同时实现外部监督与内部强化。此外,我们引入三种新型奖励控制策略以确保训练稳定性与优化效果。大量实验表明,ViSurf在不同基准测试中始终优于独立的SFT、RLVR以及传统两阶段流程。深度分析进一步佐证了这些发现,验证了ViSurf的推导过程与设计原理。