Importance sampling (IS) represents a fundamental technique for a large surge of off-policy reinforcement learning approaches. Policy gradient (PG) methods, in particular, significantly benefit from IS, enabling the effective reuse of previously collected samples, thus increasing sample efficiency. However, classically, IS is employed in RL as a passive tool for re-weighting historical samples. However, the statistical community employs IS as an active tool combined with the use of behavioral distributions that allow the reduction of the estimate variance even below the sample mean one. In this paper, we focus on this second setting by addressing the behavioral policy optimization (BPO) problem. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance as much as possible. We provide an iterative algorithm that alternates between the cross-entropy estimation of the minimum-variance behavioral policy and the actual policy optimization, leveraging on defensive IS. We theoretically analyze such an algorithm, showing that it enjoys a convergence rate of order $O(\epsilon^{-4})$ to a stationary point, but depending on a more convenient variance term w.r.t. standard PG methods. We then provide a practical version that is numerically validated, showing the advantages in the policy gradient estimation variance and on the learning speed.
翻译:重要性采样(IS)是大量离策略强化学习方法的基础技术。特别是策略梯度(PG)方法,通过IS显著受益,能够有效重用先前收集的样本,从而提高样本效率。然而,在强化学习中,IS传统上被用作一种被动工具来重新加权历史样本。但统计学界将IS作为一种主动工具,结合行为分布的运用,使得估计方差甚至能够低于样本均值方差。本文聚焦于后者,通过解决行为策略优化(BPO)问题展开研究。我们寻找最优行为策略来收集样本,以尽可能降低策略梯度方差。我们提出了一种迭代算法,该算法在利用防御性IS的基础上,交替进行最小方差行为策略的交叉熵估计与实际策略优化。我们对算法进行了理论分析,证明其以$O(\epsilon^{-4})$的收敛速度收敛至稳定点,但相比标准PG方法,该收敛速度依赖于更优的方差项。最后,我们提供了经过数值验证的实用版本,展示了其在策略梯度估计方差和学习速度方面的优势。