Marginalized Importance Sampling for Off-Environment Policy Evaluation

Reinforcement Learning (RL) methods are typically sample-inefficient, making it challenging to train and deploy RL-policies in real world robots. Even a robust policy trained in simulation, requires a real-world deployment to assess their performance. This paper proposes a new approach to evaluate the real-world performance of agent policies without deploying them in the real world. The proposed approach incorporates a simulator along with real-world offline data to evaluate the performance of any policy using the framework of Marginalized Importance Sampling (MIS). Existing MIS methods face two challenges: (1) large density ratios that deviate from a reasonable range and (2) indirect supervision, where the ratio needs to be inferred indirectly, thus exacerbating estimation error. Our approach addresses these challenges by introducing the target policy's occupancy in the simulator as an intermediate variable and learning the density ratio as the product of two terms that can be learned separately. The first term is learned with direct supervision and the second term has a small magnitude, thus making it easier to run. We analyze the sample complexity as well as error propagation of our two step-procedure. Furthermore, we empirically evaluate our approach on Sim2Sim environments such as Cartpole, Reacher and Half-Cheetah. Our results show that our method generalizes well across a variety of Sim2Sim gap, target policies and offline data collection policies. We also demonstrate the performance of our algorithm on a Sim2Real task of validating the performance of a 7 DOF robotic arm using offline data along with a gazebo based arm simulator.

翻译：强化学习方法通常样本效率低下，这使得在真实机器人上训练和部署强化学习策略颇具挑战性。即使是在仿真环境中训练的鲁棒策略，也需要在真实世界部署才能评估其性能。本文提出了一种新方法，可在不进行真实世界部署的情况下评估智能体策略的真实世界性能。该方法结合仿真器与真实世界离线数据，利用边缘化重要性采样（MIS）框架评估任意策略的性能。现有MIS方法面临两个挑战：（1）密度比率过大，偏离合理范围；（2）间接监督，即需要通过间接推断获得比率，从而加剧估计误差。我们的方法通过引入目标策略在仿真器中的占据度量作为中间变量，并将密度比率学习为两个可分别学习的项的乘积来解决这些挑战。第一项采用直接监督学习，第二项量级较小，因此更易于实现。我们分析了二步过程的样本复杂度及误差传播机制。此外，我们在Cartpole、Reacher和Half-Cheetah等Sim2Sim环境中进行了实证评估。结果表明，我们的方法能很好地泛化到各类Sim2Sim差距、目标策略及离线数据收集策略。我们还通过Sim2Real任务展示了算法性能：利用离线数据与基于Gazebo的机械臂仿真器验证七自由度机械臂的性能。