In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds approximations within a set of representable, parametric distributions. Typically, this involves a projection of the unconstrained distribution onto the set of simplified distributions. We argue that this projection step entails a strong inductive bias when coupled with neural networks and gradient descent, thereby profoundly impacting the generalization behavior of learned models. In order to facilitate reliable uncertainty estimation through diversity, this work studies the combination of several different projections and representations in a distributional ensemble. We establish theoretical properties of such projection ensembles and derive an algorithm that uses ensemble disagreement, measured by the average $1$-Wasserstein distance, as a bonus for deep exploration. We evaluate our algorithm on the behavior suite benchmark and find that diverse projection ensembles lead to significant performance improvements over existing methods on a wide variety of tasks with the most pronounced gains in directed exploration problems.
翻译:与经典强化学习不同,分布强化学习算法旨在学习回报的分布而非其期望值。由于回报分布的性质通常先验未知或任意复杂,常见方法是在一组可表示的参数化分布中寻求近似。这通常涉及将无约束分布投影到简化分布集合上。我们认为,当与神经网络和梯度下降结合时,这一投影步骤会引入强归纳偏置,从而深刻影响所学模型的泛化行为。为通过多样性促进可靠的 uncertainty 估计,本文研究了分布集成中多种不同投影与表示的组合。我们建立了此类投影集成的理论性质,并推导出一种算法,该算法以平均 $1$-Wasserstein 距离度量的集成分歧作为深度探索的奖励。我们在行为套件基准上评估该算法,发现多样投影集成在广泛任务上较现有方法带来显著性能提升,其中在定向探索问题中增益最为突出。