General Value Functions (GVFs) (Sutton et al, 2011) are an established way to represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique pseudo-reward. Multiple GVFs can be estimated in parallel using off-policy learning from a single stream of data, often sourced from a fixed behavior policy or pre-collected dataset. This leaves an open question: how can behavior policy be chosen for data-efficient GVF learning? To address this gap, we propose GVFExplorer, which aims at learning a behavior policy that efficiently gathers data for evaluating multiple GVFs in parallel. This behavior policy selects actions in proportion to the total variance in the return across all GVFs, reducing the number of environmental interactions. To enable accurate variance estimation, we use a recently proposed temporal-difference-style variance estimator. We prove that each behavior policy update reduces the mean squared error in the summed predictions over all GVFs. We empirically demonstrate our method's performance in both tabular representations and nonlinear function approximation.
翻译:通用值函数(GVFs)(Sutton等人,2011年)是强化学习中表示预测性知识的成熟方法。每个GVF基于独特的伪奖励计算给定策略的期望回报。多个GVF可通过离线策略学习从单一数据流中并行估计,该数据流通常来源于固定行为策略或预收集数据集。这引出一个开放性问题:如何选择行为策略以实现数据高效的GVF学习?为填补这一空白,我们提出GVFExplorer方法,其目标是通过学习行为策略高效收集数据,以并行评估多个GVF。该行为策略根据所有GVF累积的回报总方差比例选择动作,从而减少与环境交互的次数。为实现精确的方差估计,我们采用近期提出的时序差分式方差估计器。我们证明,每次行为策略更新均会降低所有GVF预测总和的均方误差。通过表格表示与非线性函数逼近两种场景的实验,我们实证展示了该方法的效果。