In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).
翻译:在强化学习中,状态表示是处理大规模或连续状态空间的关键。尽管深度学习算法的一个承诺是自动构建针对其试图解决的任务精心调整的特征,但在深度强化学习智能体的端到端训练中,这种表示可能不会自然浮现。为缓解此问题,辅助目标常被纳入学习过程,并有助于塑造所学到的状态表示。自举方法是目前进行这些额外预测的首选方法。然而,目前尚不清楚这些算法捕获了哪些特征,以及它们与其他基于辅助任务的方法所捕获的特征之间的关系。在本文中,我们弥补了这一空白,并对时序差分学习(Sutton,1988)所学习的状态表示提供了理论刻画。令人惊讶的是,我们发现,在策略评估设置下,对于环境的大多数转移结构,该表示与蒙特卡洛和残差梯度算法所学到的特征存在差异。我们描述了这些表示在策略评估中的有效性,并利用我们的理论分析设计了新的辅助学习规则。我们通过在不同累积函数下(如四室域(Sutton等,1999)和山车(Moore,1990)等经典领域)对这些学习规则进行实证比较,补充了我们的理论结果。