Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.
翻译:真实机器人评估对于理解学习到的操作策略能否在精心示范之外的环境中可靠运行至关重要。这一需求对通用操作接口(UMI)风格策略尤为迫切,因为其性能依赖于腕部视角观测、动作表征、数据采集与物理部署之间的耦合关系。现有真实世界基准虽已取得重要进展,但并非针对此类UMI数据到部署场景设计。我们提出UMI-Bench 1.0,这是一个面向本地化真实机器人操作的标准化基准,用于评估UMI风格操作策略。据我们所知,这是首个专为基于UMI的操作模型进行真实世界评估而设计的基准。UMI-Bench将数据采集、场景重置、策略执行、结果记录与任务因素分析统一到标准化协议中。通过使完整评估过程可复现且可审计,UMI-Bench为衡量经过UMI训练的策略在实际物理操作中的泛化能力提供了实用测试平台。