Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.
翻译:视觉到代码任务要求模型将结构化视觉输入(如图表、表格和SVG)重建为具有高视觉保真度的可执行或结构化表示。尽管近期的大型视觉语言模型通过监督微调取得了显著成果,但由于奖励信号失准,强化学习仍面临挑战。现有奖励方法要么依赖文本规则,要么采用粗糙的视觉嵌入相似度,两者均无法捕捉细粒度视觉差异,且易受奖励破解影响。我们提出视觉等价奖励模型,这是一种多模态生成式奖励模型,可在渲染视觉空间中直接评估视觉到代码质量,提供细粒度、可解释且与任务无关的反馈。该模型集成至强化学习后,使Qwen3-VL-8B-Instruct在图表到代码任务上提升+8.4分,在表格与SVG解析任务上获得稳定增益(平均+2.7、+4.1分),并通过反思与修正机制进一步强化测试时扩展能力。我们还构建了VisualCritic-RewardBench基准测试集,用于评估结构化视觉数据上的细粒度图像间差异。实验表明,8B参数的视觉等价奖励模型显著超越Qwen3-VL-235B-Instruct,并接近领先的闭源模型性能。研究结果证明,无论任务特异性如何,细粒度视觉奖励监督对视觉到代码强化学习既是必要的也是充分的。