Marginalized Importance Sampling for Off-Environment Policy Evaluation

Reinforcement Learning (RL) methods are typically sample-inefficient, making it challenging to train and deploy RL-policies in real world robots. Even a robust policy trained in simulation requires a real-world deployment to assess their performance. This paper proposes a new approach to evaluate the real-world performance of agent policies prior to deploying them in the real world. Our approach incorporates a simulator along with real-world offline data to evaluate the performance of any policy using the framework of Marginalized Importance Sampling (MIS). Existing MIS methods face two challenges: (1) large density ratios that deviate from a reasonable range and (2) indirect supervision, where the ratio needs to be inferred indirectly, thus exacerbating estimation error. Our approach addresses these challenges by introducing the target policy's occupancy in the simulator as an intermediate variable and learning the density ratio as the product of two terms that can be learned separately. The first term is learned with direct supervision and the second term has a small magnitude, thus making it computationally efficient. We analyze the sample complexity as well as error propagation of our two step-procedure. Furthermore, we empirically evaluate our approach on Sim2Sim environments such as Cartpole, Reacher, and Half-Cheetah. Our results show that our method generalizes well across a variety of Sim2Sim gap, target policies and offline data collection policies. We also demonstrate the performance of our algorithm on a Sim2Real task of validating the performance of a 7 DoF robotic arm using offline data along with the Gazebo simulator.

翻译：强化学习方法通常样本效率低下，这使得在真实机器人上训练和部署强化学习策略具有挑战性。即使在仿真环境中训练的鲁棒策略也需要在真实环境中部署以评估其性能。本文提出了一种新方法，在策略部署到真实环境之前评估其实际表现。我们的方法结合仿真器和真实离线数据，利用边际化重要性采样框架评估任意策略的性能。现有MIS方法面临两个挑战：（1）密度比值过大偏离合理范围；（2）间接监督导致比值需间接推断，加剧估计误差。我们的方法通过引入仿真器中目标策略的占用率作为中间变量，并将密度比值学习为两个可分别学习的项的乘积来解决这些问题。第一项通过直接监督学习，第二项具有较小量级，从而提升计算效率。我们分析了该两步过程的样本复杂度和误差传播特性。此外，在Cartpole、Reacher和Half-Cheetah等Sim2Sim环境中进行了实证评估。结果表明，我们的方法在多种Sim2Sim差距、目标策略和离线数据采集策略下具有良好的泛化能力。我们还在Sim2Real任务中展示了算法性能，使用离线数据与Gazebo仿真器验证了7自由度机械臂的性能。