State-of-the-art deep Q-learning methods update Q-values using state transition tuples sampled from the experience replay buffer. This strategy often uniformly and randomly samples or prioritizes data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies can be inefficient at learning Q-function because a state's Q-value depends on the Q-value of successor states. If the data sampling strategy ignores the precision of the Q-value estimate of the next state, it can lead to useless and often incorrect updates to the Q-values. To mitigate this issue, we organize the agent's experience into a graph that explicitly tracks the dependency between Q-values of states. Each edge in the graph represents a transition between two states by executing a single action. We perform value backups via a breadth-first search starting from that expands vertices in the graph starting from the set of terminal states and successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of goal-reaching tasks. Notably, the proposed method also outperforms baselines that consume more batches of training experience and operates from high-dimensional observational data such as images.
翻译:最先进的深度Q学习方法利用从经验回放缓冲区中采样的状态转移元组更新Q值。此类策略通常采用均匀随机采样,或基于时序差分误差等指标对数据采样进行优先级排序。然而,由于状态Q值依赖于后续状态的Q值,这种采样策略在Q函数学习中可能存在效率低下问题。若数据采样策略忽略下一状态Q值估计的精度,则可能导致Q值更新无效甚至错误。为解决该问题,我们将智能体的经验组织为图结构,显式追踪各状态Q值之间的依赖关系。图中每条边表示通过执行单个动作在两个状态之间发生的转移。我们通过从终端状态集合出发逐层扩展图顶点的广度优先搜索执行值回传,并依次向后递推。实验表明,该方法在多种目标达成任务中的数据效率显著优于多个基线方法。值得注意的是,所提方法在处理高维观测数据(如图像)时,即使采用更少的训练批次,其性能仍优于基线方法。