What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.7-5.5 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.
翻译:要构建一个能够在图表、科学、空间理解及开放式任务中工作的视觉推理系统,需要什么条件?最强的视觉语言模型(VLM)表明,这种广泛的视觉推理能力已触手可及,但其背后的方法仍不明确,被锁存在使用非公开数据的专有强化学习(RL)流程中。我们提出Vero,这是一系列完全开放的视觉语言模型,在多种视觉推理任务上达到或超越现有的开放权重模型。我们将强化学习数据和奖励扩展至六大任务类别,构建了Vero-600K(一个包含59个数据集、60万样本的数据集),并设计了能够处理异构答案格式的任务路由奖励。Vero实现了最先进的性能,在我们包含30个具有挑战性基准的VeroEval评测套件上,相较于四个基础模型平均提升3.7–5.5个点。从Qwen3-VL-8B-Instruct出发,Vero在30个基准测试中的23个上超越了Qwen3-VL-8B-Thinking,且无需额外的专有思维链数据。当从相同基础模型训练时,Vero-600K在各类任务上均优于现有的强化学习数据集。系统性的消融实验揭示,不同任务类别会引发性质上不同的推理模式,且这些模式在孤立训练时迁移能力较差,这表明广泛的数据覆盖是推动强化学习规模化提升的主要驱动力。所有数据、代码和模型均已开源。