We provide finite-sample performance guarantees for control policies executed on stochastic robotic systems. Given an open- or closed-loop policy and a finite set of trajectory rollouts under the policy, we bound the expected value, value-at-risk, and conditional-value-at-risk of the trajectory cost, and the probability of failure in a sparse rewards setting. The bounds hold, with user-specified probability, for any policy synthesis technique and can be seen as a post-design safety certification. Generating the bounds only requires sampling simulation rollouts, without assumptions on the distribution or complexity of the underlying stochastic system. We adapt these bounds to also give a constraint satisfaction test to verify safety of the robot system. Furthermore, we extend our method to apply when selecting the best policy from a set of candidates, requiring a multi-hypothesis correction. We show the statistical validity of our bounds in the Ant, Half-cheetah, and Swimmer MuJoCo environments and demonstrate our constraint satisfaction test with the Ant. Finally, using the 20 degree-of-freedom MuJoCo Shadow Hand, we show the necessity of the multi-hypothesis correction.
翻译:我们为在随机机器人系统上执行的控制策略提供了有限样本性能保证。对于给定的开环或闭环策略以及在该策略下的一组有限轨迹样本,我们给出了轨迹成本的期望值、风险价值、条件风险价值以及稀疏奖励场景中失败概率的边界。这些边界以用户指定的概率成立,适用于任何策略合成技术,并可视为一种设计后的安全认证。生成边界仅需采样仿真轨迹,无需对底层随机系统的分布或复杂性做出假设。我们进一步调整这些边界,以提供用于验证机器人系统安全性的约束满足测试。此外,我们将方法扩展到从候选策略集中选择最优策略的场景,这需要多重假设校正。我们在Ant、Half-cheetah和Swimmer MuJoCo环境中展示了边界的统计有效性,并利用Ant进行了约束满足测试。最后,通过20自由度的MuJoCo Shadow Hand手部模型,我们证明了多重假设校正的必要性。