Episodic Reinforcement Learning with Expanded State-reward Space

Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable sample efficiency by recalling past experiences from episodic memory. However, existing EC-based methods suffer from the limitation of potential misalignment between the state and reward spaces for neglecting the utilization of (past) retrieval states with extensive information, which probably causes inaccurate value estimation and degraded policy performance. To tackle this issue, we introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. To be specific, we reuse the historical states retrieved by EC as part of the input states and integrate the retrieved MC-returns into the immediate reward in each interactive transition. As a result, our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss. Empirical results on challenging Box2d and Mujoco tasks demonstrate the superiority of our method over a recent sibling method and common baselines. Further, we also verify our method's effectiveness in alleviating Q-value overestimation by additional experiments of Q-value comparison.

翻译：受深度神经网络驱动，深度强化学习在游戏、医疗和自动驾驶等多个领域取得了显著的实证成功。尽管取得了这些进展，深度强化学习仍被认为数据效率低下，因为有效策略需要大量的环境样本。近年来，基于回合控制的无模型深度强化学习方法通过从回合记忆中回忆过往经验实现了样本效率提升。然而，现有基于回合控制的方法存在状态与奖励空间潜在错配的局限性——由于忽视了包含丰富信息的（历史）检索状态的利用，可能导致不准确的价值估计和策略性能下降。为解决这一问题，我们提出了一种高效且具有扩展状态-奖励空间的回合控制深度强化学习框架，其中作为输入使用的扩展状态与训练中使用的扩展奖励均包含历史与当前信息。具体而言，我们将回合控制检索所得的历史状态复用为部分输入状态，并将检索到的蒙特卡洛收益整合到每次交互转移的即时奖励中。由此，我们的方法能够同时实现检索信息的充分利用，并通过时序差分损失对状态价值进行更优评估。在具有挑战性的Box2d和Mujoco任务上的实证结果表明，我们的方法相较于近期同类方法与常见基线具有优越性。此外，我们通过额外的Q值对比实验，进一步验证了该方法在缓解Q值过高估计方面的有效性。