Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero
翻译:近年来,多模态学习的进展显著提升了视觉-语言模型(VLMs)的推理能力。然而,当前最先进的方法严重依赖大规模人工标注数据集,这些数据集的获取成本高昂且耗时。为克服这一局限,我们提出了V-Zero——一种通用的后训练框架,能够仅利用未标注图像实现自我改进。V-Zero通过实例化两个不同角色(提问者与解答者)建立协同进化循环:提问者通过利用对比直觉猜测与推理结果的双轨推理奖励,学习合成高质量、具有挑战性的问题;解答者则使用基于其自身采样响应的多数投票生成的伪标签进行优化。两者通过组相对策略优化(GRPO)进行迭代训练,驱动相互增强的循环。值得注意的是,在没有任何人工标注的情况下,V-Zero在Qwen2.5-VL-7B-Instruct模型上实现了持续的性能提升,其中视觉数学推理能力提升+1.7,通用视觉中心任务提升+2.6,这证明了多模态系统自我改进的潜力。代码发布于 https://github.com/SatonoDia/V-Zero