Rapid progress in imitation learning, foundation models, and large-scale datasets has led to robot manipulation policies that generalize to a wide-range of tasks and environments. However, rigorous evaluation of these policies remains a challenge. Typically in practice, robot policies are often evaluated on a small number of hardware trials without any statistical assurances. We present SureSim, a framework to augment large-scale simulation with relatively small-scale real-world testing to provide reliable inferences on the real-world performance of a policy. Our key idea is to formalize the problem of combining real and simulation evaluations as a prediction-powered inference problem, in which a small number of paired real and simulation evaluations are used to rectify bias in large-scale simulation. We then leverage non-asymptotic mean estimation algorithms to provide confidence intervals on mean policy performance. Using physics-based simulation, we evaluate both diffusion policy and multi-task fine-tuned \(\pi_0\) on a joint distribution of objects and initial conditions, and find that our approach saves over \(20-25\%\) of hardware evaluation effort to achieve similar bounds on policy performance.
翻译:模仿学习、基础模型和大规模数据集的快速发展,使得机器人操作策略能够泛化至广泛的任务与环境。然而,对这些策略进行严格评估仍具挑战。实践中,机器人策略通常仅通过少量硬件试验进行评估,缺乏统计保证。本文提出SureSim框架,该框架通过相对小规模的现实世界测试来增强大规模仿真,从而为策略在真实世界中的性能提供可靠推断。我们的核心思想是将真实与仿真评估的结合问题形式化为预测驱动的推断问题,其中利用少量成对的真实与仿真评估来校正大规模仿真中的偏差。随后,我们借助非渐近均值估计算法,为策略平均性能提供置信区间。基于物理仿真,我们在物体与初始条件的联合分布上评估了扩散策略与多任务微调\(\pi_0\),发现我们的方法在达到相近策略性能边界时,可节省超过\(20-25\%\)的硬件评估成本。