SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23\% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.

翻译：图表是复杂数据的高密度视觉载体，也是信息提取与分析的重要媒介。由于需要精确且复杂的视觉推理，自动化图表理解对现有的多模态大语言模型构成了显著挑战。许多通过强化学习训练的多模态大语言模型面临着信用分配的难题：其优势估计通常在轨迹层面进行，无法区分单个生成响应中正确与错误的推理步骤。为克服这一局限，我们提出了SketchVL——一种新型多模态大语言模型，其通过专为轨迹内细粒度信用分配设计的新强化学习算法FinePO进行优化。SketchVL的核心方法是在图像上以标记形式绘制中间推理步骤，并将标注后的图像反馈给自身，从而构建鲁棒的多步推理过程。在训练阶段，FinePO算法利用细粒度过程奖励模型对轨迹中的每个绘图动作进行评分，实现每一步的精确信用分配。该机制使FinePO能够在轨迹全局成功时更强烈地奖励正确标记，在轨迹全局欠优时更严厉地惩罚错误标记，从而获得细粒度的强化信号。实验表明，SketchVL能够将其步骤级行为与FinePRM对齐，在图表数据集、自然图像数据集及数学任务上相比基线模型平均性能提升7.23%，为训练强推理模型提供了新的研究方向。