Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well understood; in practice, however, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent's network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)'s proto-value functions to deep reinforcement learning -- accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment's reward function.
翻译:辅助任务能够提升深度强化学习智能体所学表示的质量。理论上,其作用机制已得到较好理解;但在实践中,这些任务主要仍被用于支持主学习目标,而非作为学习表示的方法。考虑到许多辅助任务是过程性定义的,可视为环境信息的本质无限来源,这一现象或许令人惊讶。基于此观察,我们研究了辅助任务在学习丰富表示方面的有效性,重点关注任务数量与智能体网络规模同步增加的情形。为此,我们推导出基于后继度量的一类新型辅助任务。这些任务易于实现且具有令人满意的理论性质。结合适当的离策略学习规则,所得表示学习算法可理解为将Mahadevan & Maggioni (2007)的原初值函数扩展到深度强化学习——因此,我们将所得对象称为原初值网络。通过在Arcade学习环境上的一系列实验,我们证明原初值网络能生成丰富特征,仅需使用线性近似和少数(约400万次)与环境奖励函数的交互,即可获得与既有算法相当的性能。