While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the $\ell_1$ distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.
翻译:尽管分布强化学习(DistRL)在实践中已表现出有效性,但关于其在何种条件下优于普通非分布强化学习(vanilla RL)以及为何更优的问题仍未得到解答。本文通过小损失边界(small-loss bounds)视角解释DistRL的优势,此类边界依赖于具体实例,且随最优可达成本规模变化。特别地,当最优成本较小时,我们的边界收敛速度显著快于非分布方法。作为预热,我们提出一种分布上下文赌博机(DistCB)算法,证明其具有小损失遗憾界,并在三项实际任务中优于现有最优方法。在在线强化学习中,我们提出一种基于最大似然估计构建置信集的DistRL算法,并证明该算法在低秩马尔可夫决策过程中具有新型小损失PAC界。作为分析的一部分,我们引入$\ell_1$分布型eluder维度,该概念可能具有独立研究价值。此外,在离线强化学习中,我们证明悲观DistRL具有离线场景下首次出现的小损失PAC界,且对不良单策略覆盖更具鲁棒性。