Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.
翻译:自校正对于解决视觉语言模型(VLMs)中的复杂推理问题至关重要。然而,现有的强化学习(RL)方法难以学习自校正,因为有效的自校正行为极少出现,导致学习信号极其稀疏。为应对这一挑战,我们提出了校正专用回放(Octopus),一种通过重组现有回放来合成密集自校正示例的RL回放增强框架。这种增强方法同时提高了样本效率(得益于回放重用)并通过均衡监督稳定了RL优化。此外,我们引入了一种响应掩码策略,将自校正与直接推理解耦,避免了信号冲突,并使两种行为都能被有效学习。在此基础上,我们推出了Octopus-8B,一个具备可控自校正能力的推理VLM。在7个基准测试中,它在开源VLMs中取得了最先进的性能,以1.0分的优势超越了最佳的RLVR基线,同时每步训练时间仅需其$0.72\times$。