Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70\%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.
翻译:数据稀缺性仍然是制约机器人学发展的最主要限制因素之一。然而,现实世界中可用的机器人数据量正在呈指数级增长,这为大规模数据利用创造了新的机遇。可靠的时序任务完成度预测有助于对此类数据进行自动化的大规模标注与策展。生成式价值学习(GVL)方法近期被提出,该方法利用视觉-语言模型(VLMs)中嵌入的知识,从视觉观测中预测任务进展。基于GVL,我们提出了OpenGVL,这是一个用于评估涵盖机器人及人体演示的多种复杂操作任务中任务进展的综合基准。我们对公开可用的开源基础模型进行了能力评估,结果表明开源模型系列在时序进展预测任务上的性能显著落后于闭源模型,仅能达到后者约$70\%$的水平。此外,我们展示了OpenGVL如何作为一个实用工具,用于自动化数据策展与筛选,从而实现对大规模机器人数据集的高效质量评估。我们已在\href{github.com/budzianowski/opengvl}{OpenGVL}发布该基准及其完整代码库。