To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms on-policy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.
翻译:为了无偏地评估多个目标策略,强化学习实践者中的主流方法是分别运行和评估每个目标策略。然而,这种评估方法远非高效,因为样本在不同策略之间无法共享,且通过运行目标策略来评估其自身实际上并非最优方案。本文通过设计一个定制化的行为策略来降低所有目标策略估计量的方差,从而解决了这两个缺陷。理论上,我们证明了在特定条件下,执行该行为策略所需样本量大幅减少,且在每个目标策略上的表现均优于同策略评估。实证研究表明,与先前最佳方法相比,我们的估计量具有显著更低的方差,并在广泛的环境中实现了最先进的性能。