Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an "episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo and causal discovery tasks.
翻译:在强化学习任务中,对轨迹片段的奖励评估已成为广泛存在的性能瓶颈。本文旨在从大批量轨迹片段中挑选出具有代表性的子集,仅对该子集实际计算奖励,以实现更高效的政策梯度迭代。我们通过构建折扣累积回报或奖励的高斯过程模型,在轨迹片段空间上导出正定核,运用"轨迹级"核四边方法压缩样本轨迹片段的信息,并将精简后的轨迹片段传递给策略网络进行梯度更新。本文同时给出了该方法的理论基础,以及在MuJoCo和因果发现任务中的数值实验结果。