Real-world decision-making problems are usually accompanied by delayed rewards, which affects the sample efficiency of Reinforcement Learning, especially in the extremely delayed case where the only feedback is the episodic reward obtained at the end of an episode. Episodic return decomposition is a promising way to deal with the episodic-reward setting. Several corresponding algorithms have shown remarkable effectiveness of the learned step-wise proxy rewards from return decomposition. However, these existing methods lack either attribution or representation capacity, leading to inefficient decomposition in the case of long-term episodes. In this paper, we propose a novel episodic return decomposition method called Diaster (Difference of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point, and the step-wise proxy rewards come from differences in expectation. We theoretically and empirically verify that the decomposed proxy reward function can guide the policy to be nearly optimal. Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance.
翻译:现实世界中的决策问题通常伴随着延迟奖励,这影响了强化学习的样本效率,特别是在仅能在片段结束时获得片段奖励的极端延迟情况下。片段回报分解是处理片段奖励设置的一种有前景的方法。几种相关算法已显示出从回报分解中学习到的逐步骤代理奖励的显著有效性。然而,现有方法在归因能力或表示能力方面存在不足,导致在长周期片段中分解效率低下。本文提出了一种新颖的片段回报分解方法——Diaster(隐式分配子轨迹奖励差异法)。Diaster将任意片段奖励分解为任意分割点处两个子轨迹的信用值,而逐步骤代理奖励则来自期望差异。我们从理论和实验上验证了分解后的代理奖励函数能够引导策略接近最优。实验结果表明,我们的方法在样本效率和性能方面均优于此前的最先进方法。