RoboProcessBench: Benchmarking Process-Aware Understanding in Vision-Language Robotic Manipulation

Vision-language models (VLMs) are increasingly explored as visual critics, reward generators, and failure detectors in robotic manipulation. These roles implicitly require models to judge not only final task success, but also how a manipulation execution is physically and temporally progressing. However, existing evaluations fail to test whether VLMs possess fine-grained process understanding. To address this gap, we present RoboProcessBench, a benchmark for process-aware understanding in vision-language robotic manipulation. RoboProcessBench decomposes such capability into two complementary dimensions, \emph{static monitoring} and \emph{dynamic reasoning}, instantiated as 12 diagnostic question families covering phase, contact, motion, coordination, primitive-local progress, temporal order, outcome, and primitive-level transitions. Built from physically grounded execution traces, the curated benchmark corpus ProcessData contains \textasciitilde 58k question-answer pairs across 260 manipulation tasks, which is further split into ProcessData-SFT and ProcessData-Eval for post-training and evaluation purposes. Extensive evaluation of various VLMs on ProcessData-Eval reveals broad limitations across 12 diagnostic task families, suggesting current models still lack robust process-aware understanding of manipulation executions. But with ProcessData-SFT, the post-trained \textit{Qwen2.5-VL-7B} and \textit{InternVL-3-8B} exhibit consistent gains on local state, motion, progress, and primitive-aware cues. These results demonstrate that RoboProcessBench serves as both an evaluation benchmark and a learnable supervision source for developing VLMs capable of monitoring and evaluating robotic manipulation processes. Project webpage: \href{https://processbench-2026.github.io/RoboProcessBench-Web/}{https://processbench-2026.github.io}.

翻译：视觉语言模型（VLM）正被日益探索为机器人操纵中的视觉评判器、奖励生成器和故障检测器。这些角色隐含要求模型不仅能判断最终任务成功与否，还能评估操纵执行过程在物理和时间维度上的进展。然而，现有评测未能检验VLM是否具备细粒度的过程理解能力。为填补这一空白，我们提出RoboProcessBench——一个面向视觉语言机器人操纵的过程感知理解基准。该基准将此类能力分解为两个互补维度：*静态监控*与*动态推理*，具体化为12类诊断性问题族，涵盖阶段、接触、运动、协调、基元级进度、时序顺序、结果及基元级转换。基于物理驱动的执行轨迹，我们构建了基准语料库ProcessData，包含约5.8万对问答，覆盖260个操纵任务，并进一步划分为ProcessData-SFT和ProcessData-Eval以用于后训练与评估。在ProcessData-Eval上对多种VLM的广泛评估揭示了其在12类诊断性任务上的普遍局限，表明现有模型仍缺乏对操纵执行的稳健过程感知理解能力。然而，经ProcessData-SFT后训练的*Qwen2.5-VL-7B*与*InternVL-3-8B*在局部状态、运动、进度和基元感知线索上表现出一致提升。这些结果表明，RoboProcessBench既可作为评估基准，也可作为可学习的监督源，用于开发具备操纵过程监控与评估能力的VLM。项目页面：\href{https://processbench-2026.github.io/RoboProcessBench-Web/}{https://processbench-2026.github.io}。