How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

With the rise of stochastic generative models in robot policy learning, end-to-end visuomotor policies are increasingly successful at solving complex tasks by learning from human demonstrations. Nevertheless, since real-world evaluation costs afford users only a small number of policy rollouts, it remains a challenge to accurately gauge the performance of such policies. This is exacerbated by distribution shifts causing unpredictable changes in performance during deployment. To rigorously evaluate behavior cloning policies, we present a framework that provides a tight lower-bound on robot performance in an arbitrary environment, using a minimal number of experimental policy rollouts. Notably, by applying the standard stochastic ordering to robot performance distributions, we provide a worst-case bound on the entire distribution of performance (via bounds on the cumulative distribution function) for a given task. We build upon established statistical results to ensure that the bounds hold with a user-specified confidence level and tightness, and are constructed from as few policy rollouts as possible. In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware. Specifically, we (i) empirically validate the guarantees of the bounds in simulated manipulation settings, (ii) find the degree to which a learned policy deployed on hardware generalizes to new real-world environments, and (iii) rigorously compare two policies tested in out-of-distribution settings. Our experimental data, code, and implementation of confidence bounds are open-source.

翻译：随着随机生成模型在机器人策略学习中的兴起，端到端的视觉运动策略通过从人类示范中学习，在解决复杂任务方面日益成功。然而，由于现实世界评估成本仅允许用户进行少量策略部署，准确衡量此类策略的性能仍是一个挑战。部署过程中由分布偏移引起的性能不可预测变化加剧了这一挑战。为严格评估行为克隆策略，我们提出了一个框架，该框架使用最少的实验性策略部署次数，为任意环境中的机器人性能提供一个紧密的下界。值得注意的是，通过对机器人性能分布应用标准随机序，我们为给定任务提供了整个性能分布（通过累积分布函数的界）的最坏情况界。我们基于已确立的统计结果，确保这些界在用户指定的置信水平和紧密度下成立，并且由尽可能少的策略部署次数构建而成。在实验中，我们在仿真和硬件平台上评估了视觉运动操作策略。具体而言，我们（i）在仿真操作设置中实证验证了这些界所保证的性质，（ii）探究了在硬件上部署的学习策略对新现实世界环境的泛化程度，以及（iii）严格比较了在分布外设置中测试的两种策略。我们的实验数据、代码及置信界实现均已开源。