Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.
翻译:大多数强化学习从业者使用在线蒙特卡洛估计器对超参数进行调整或测试不同算法设计选择,通过反复在环境中执行策略来获取平均结果。然而,这种与环境的大量交互在许多场景中代价高昂。本文提出了一种新颖方法,在保持在线蒙特卡洛估计器无偏性的同时提升其数据效率。我们首先设计了一种定制化的闭式行为策略,能够可证明地降低在线蒙特卡洛估计器的方差;随后开发了高效算法,从先前收集的离线数据中学习该闭式行为策略。通过理论分析刻画了行为策略学习误差对方差缩减量的影响。与先前工作相比,我们的方法在更广泛的环境集合中取得了更优的实验性能,且对离线数据的要求更低。