In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than $1$, we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass-$1$ signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.
翻译:在强化学习(RL)中,对多元奖励信号的考量推动了多目标决策、迁移学习和表示学习领域的根本性进展。本文首次提出了无需预设模型且计算上可处理的算法,用于可证明收敛的多元分布动态规划与时间差分学习。我们的收敛速率与标量奖励设定中的常见速率相匹配,并进一步揭示了近似回报分布表示的保真度与奖励维度之间的函数关系。令人惊讶的是,当奖励维度大于$1$时,我们发现传统的分类TD学习分析框架失效,为此我们提出了一种向质量-$1$符号测度空间投影的新方法。最后,借助我们的理论结果与仿真实验,我们识别了实践中影响多元分布强化学习性能的分布表示之间的权衡关系。