Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an ``episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo tasks.
翻译:在强化学习的广泛任务中,对情节的奖励评估已成为一个瓶颈。本文旨在从大量情节批次中选取一个规模较小但具有代表性的子集,仅对该子集实际计算奖励,以实现更高效的策略梯度迭代。我们构建了折扣累积回报或奖励的高斯过程模型,从而在情节空间上推导出一个正定核;采用“情节式”核求积方法压缩样本情节的信息,并将精简后的情节传递给策略网络以进行梯度更新。我们从理论上阐述了该流程,并在MuJoCo任务中给出了数值示例验证。