The empirical success of distributional reinforcement learning~(RL) highly depends on the distribution representation and the choice of distribution divergence. In this paper, we propose \textit{Sinkhorn distributional RL~(SinkhornDRL)} that learns unrestricted statistics from return distributions and leverages Sinkhorn divergence to minimize the difference between current and target Bellman return distributions. Theoretically, we prove the contraction properties of SinkhornDRL, consistent with the interpolation nature of Sinkhorn divergence between Wasserstein distance and Maximum Mean Discrepancy~(MMD). We also establish the equivalence between Sinkhorn divergence and a regularized MMD with a regularized Moment Matching behavior, contributing to explaining the superiority of SinkhornDRL. Empirically, we show that SinkhornDRL is consistently better or comparable to existing algorithms on the Atari games suite.
翻译:分布强化学习的实证成功高度依赖于分布表示及分布散度的选择。本文提出**辛克霍恩分布强化学习(SinkhornDRL)**,该方法学习回报分布的无约束统计量,并利用辛克霍恩散度最小化当前与目标贝尔曼回报分布之间的差异。理论上,我们证明了SinkhornDRL的收缩性质,这与辛克霍恩散度在Wasserstein距离与最大均值差异(MMD)之间的插值特性一致。我们还建立了辛克霍恩散度与正则化MMD(具有正则化矩匹配行为)之间的等价性,这有助于解释SinkhornDRL的优越性。实验表明,在Atari游戏套件上,SinkhornDRL始终优于或与现有算法相当。