Machine learning for robot manipulation promises to unlock generalization to novel tasks and environments. But how should we measure the progress of these policies towards generalization? Evaluating and quantifying generalization is the Wild West of modern robotics, with each work proposing and measuring different types of generalization in their own, often difficult to reproduce settings. In this work, our goal is (1) to outline the forms of generalization we believe are important for robot manipulation in a comprehensive and fine-grained manner, and (2) to provide reproducible guidelines for measuring these notions of generalization. We first propose STAR-Gen, a taxonomy of generalization for robot manipulation structured around visual, semantic, and behavioral generalization. Next, we instantiate STAR-Gen with two case studies on real-world benchmarking: one based on open-source models and the Bridge V2 dataset, and another based on the bimanual ALOHA 2 platform that covers more dexterous and longer horizon tasks. Our case studies reveal many interesting insights: for example, we observe that open-source vision-language-action models often struggle with semantic generalization, despite pre-training on internet-scale language datasets. We provide videos and other supplementary material at stargen-taxonomy.github.io.
翻译:机器人操作的机器学习有望实现对新颖任务和环境的泛化能力。但应如何衡量这些策略在泛化方面的进展?评估与量化泛化是现代机器人学的"蛮荒之地",每项研究都在各自通常难以复现的环境中提出并测量不同类型的泛化。本工作的目标是:(1) 以全面细化的方式阐述我们认为对机器人操作至关重要的泛化形式,(2) 为测量这些泛化概念提供可复现的指导原则。我们首先提出STAR-Gen——围绕视觉、语义和行为泛化构建的机器人操作泛化分类法。随后通过两个真实世界基准测试案例对STAR-Gen进行实例化:一个基于开源模型与Bridge V2数据集,另一个基于覆盖更灵巧且更长时域任务的双臂ALOHA 2平台。案例研究揭示了诸多重要发现:例如,尽管经过互联网规模语言数据集的预训练,开源视觉-语言-动作模型在语义泛化方面仍常面临挑战。相关视频及补充材料详见stargen-taxonomy.github.io。