Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately $70\%$ of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \href{github.com/budzianowski/opengvl}{OpenGVL}.
翻译:数据稀缺性仍然是制约机器人学发展的最主要因素之一。然而,现实世界中可用的机器人数据正在呈指数级增长,这为大规模数据利用创造了新的机遇。可靠的时序任务完成度预测有助于实现大规模数据的自动标注与策展。生成式价值学习(Generative Value Learning, GVL)方法近期被提出,该方法利用视觉-语言模型(Vision-Language Models, VLMs)中嵌入的知识,从视觉观测中预测任务进展。基于GVL,我们提出了OpenGVL——一个用于评估涉及机器人及人体仿真的多样化复杂操作任务中任务进展的综合基准。我们评估了公开可用的开源基础模型的能力,结果表明开源模型系列在时序进展预测任务上的表现显著落后于闭源模型,仅能达到后者约$70\%$的性能水平。此外,我们展示了OpenGVL如何作为自动化数据策展与筛选的实用工具,实现对大规模机器人数据集的高效质量评估。我们已在\href{github.com/budzianowski/opengvl}{OpenGVL}发布该基准及其完整代码库。