Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
翻译:精确的过程监督仍然是长时程机器人操作面临的关键挑战。一个主要瓶颈在于,当前主要在监督微调范式下训练的视频多模态大语言模型,仅作为被动的“观察者”来识别正在进行的事件,而非评估当前状态相对于最终任务目标的完成度。本文提出PRIMO R1(过程推理诱导监控),一个70亿参数的框架,将视频多模态大语言模型转变为主动的“评判者”。我们利用基于结果的强化学习,激励模型生成显式的思维链以进行进度估计。此外,我们的架构通过将视频序列明确锚定在初始状态图像与当前状态图像之间,构建了结构化的时序输入。在所提出的PRIMO数据集和基准测试的支持下,跨多个领域内环境及领域外真实世界仿人机器人场景的广泛实验表明,PRIMO R1实现了最先进的性能。量化结果显示,我们的70亿参数模型将专业推理基线的平均绝对误差降低了50%,相对于720亿规模的通用多模态大语言模型展现出显著的相对精度提升。此外,PRIMO R1在困难的故障检测任务上表现出强大的零样本泛化能力。我们在RoboFail基准测试中以67.0%的准确率确立了最先进的性能,超越了如OpenAI o1等闭源模型6.0个百分点。