Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training. Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora. To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
翻译:多模态过程奖励模型(MPRMs)是多模态大语言模型中视觉推理任务实现步骤级监督的核心组件。训练MPRMs通常需要大规模蒙特卡洛(MC)标注数据集,这会产生高昂的训练成本。本文研究了MPRMs训练的数据效率问题。我们的初步实验表明,在随机子采样训练数据的情况下,MPRMs训练会迅速达到饱和,这表明现有MC标注数据集中存在大量冗余。为解释此现象,我们构建了一个理论框架并揭示:信息丰富的梯度更新取决于两个因素——正/负步骤的标签混合比例以及标签可靠性(正步骤的平均MC分数)。基于这些发现,我们提出了平衡信息分数(BIS),该指标在轨迹层面依据现有MC信号同时优先考虑混合比例与可靠性,且无需引入额外成本。在VisualProcessBench基准测试中,基于两种骨干模型(InternVL2.5-8B与Qwen2.5-VL-7B)的实验表明,经BIS筛选的数据子集仅需极小比例即可持续达到甚至超越全数据集的性能表现。值得注意的是,BIS子集仅使用10%的训练数据即可达到全数据性能,相对随机子采样方法实现了4.1%的性能提升。