In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal, i.e., the expected return. In practice, in many tasks of interest, such as policy optimization, the agent usually spends its interaction budget by collecting episodes of fixed length within a simulator (i.e., Monte Carlo simulation). However, given the discounted nature of the RL objective, this data collection strategy might not be the best option. Indeed, the rewards taken in early simulation steps weigh exponentially more than future rewards. Taking a cue from this intuition, in this paper, we design an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths, i.e., truncated. The proposed approach provably minimizes the width of the confidence intervals around the empirical estimates of the expected return of a policy. After discussing the theoretical properties of our method, we make use of our trajectory truncation mechanism to extend Policy Optimization via Importance Sampling (POIS, Metelli et al., 2018) algorithm. Finally, we conduct a numerical comparison between our algorithm and POIS: the results are consistent with our theory and show that an appropriate truncation of the trajectories can succeed in improving performance.
翻译:在强化学习中,智能体在未知环境中行动,以最大化外部奖励信号的期望累积折现和(即期望回报)。在实际应用中,在许多感兴趣的任务(如策略优化)中,智能体通常通过在模拟器内收集固定长度的回合(即蒙特卡洛模拟)来使用其交互预算。然而,考虑到强化学习目标的折现特性,这种数据收集策略可能并非最佳选择。事实上,早期模拟步骤中的奖励权重指数级地高于未来奖励。基于这一直觉,本文设计了一种先验预算分配策略,该策略导致收集不同长度的轨迹(即截断轨迹)。所提出的方法可证明地最小化了策略期望回报经验估计周围置信区间的宽度。在讨论了我们方法的理论性质后,我们将轨迹截断机制用于扩展基于重要性采样的策略优化(POIS,Metelli等人,2018)算法。最后,我们将我们的算法与POIS进行了数值比较:结果与我们的理论一致,表明适当的轨迹截断能够成功提升性能。