Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
翻译:边际重要性采样(MIS)通过测量目标策略的状态-动作占据分布与采样分布之间的密度比,是一种有前景的离策略评估方法。然而,当前最先进的MIS方法依赖复杂的优化技巧,且大多仅能解决简单的玩具问题。通过观察到密度比可以从目标策略的后继表征中计算得出,我们弥合了MIS与深度强化学习之间的鸿沟。后继表征可通过深度强化学习方法进行训练,并将奖励优化与环境的动态特性解耦,从而使所提算法稳定且适用于高维领域。我们在多种具有挑战性的Atari和MuJoCo环境中评估了该方法的表现性能。