Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and makes a better use of environment samples. The commonly used quantile regression approach to distributional RL -- based on asymmetric $L_1$ losses -- provides a flexible and effective way of learning arbitrary return distributions. In practice, it is often improved by using a more efficient, hybrid asymmetric $L_1$-$L_2$ Huber loss for quantile regression. However, by doing so, distributional estimation guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean. Indeed, asymmetric $L_2$ losses, corresponding to expectile regression, cannot be readily used for distributional temporal difference learning. Motivated by the efficiency of $L_2$-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning while keeping an estimate of the full distribution of returns. We prove that our approach approximately learns the correct return distribution, and we benchmark a practical implementation on a toy example and at scale. On the Atari benchmark, our approach matches the performance of the Huber-based IQN-1 baseline after $200$M training frames but avoids distributional collapse and keeps estimates of the full distribution of returns.
翻译:分布强化学习(RL)已在多个基准测试中被证明是有用的,因为它能够近似回报的完整分布,并更有效地利用环境样本。分布强化学习中常用的分位数回归方法——基于非对称$L_1$损失——提供了一种灵活且有效的方式来学习任意回报分布。在实践中,通常通过使用一种更高效的混合非对称$L_1$-$L_2$ Huber损失进行分位数回归来改进该方法。然而,这样做会导致分布估计的保证消失,并且我们通过实验观察到估计的分布会迅速坍缩到其均值。实际上,对应于期望回归的非对称$L_2$损失不能直接用于分布时序差分学习。基于$L_2$学习的高效性,我们提出联合学习回报分布的期望分位数和分位数,这种方式在保持对回报完整分布估计的同时,允许高效学习。我们证明了我们的方法能够近似学习正确的回报分布,并在一个玩具示例和大规模场景中对一个实际实现进行了基准测试。在Atari基准测试中,我们的方法在$200$M训练帧后达到了基于Huber损失的IQN-1基线的性能,但避免了分布坍缩,并保持了对回报完整分布的估计。