While distributional reinforcement learning (RL) has demonstrated empirical success, the question of when and why it is beneficial has remained unanswered. In this work, we provide one explanation for the benefits of distributional RL through the lens of small-loss bounds, which scale with the instance-dependent optimal cost. If the optimal cost is small, our bounds are stronger than those from non-distributional approaches. As warmup, we show that learning the cost distribution leads to small-loss regret bounds in contextual bandits (CB), and we find that distributional CB empirically outperforms the state-of-the-art on three challenging tasks. For online RL, we propose a distributional version-space algorithm that constructs confidence sets using maximum likelihood estimation, and we prove that it achieves small-loss regret in the tabular MDPs and enjoys small-loss PAC bounds in latent variable models. Building on similar insights, we propose a distributional offline RL algorithm based on the pessimism principle and prove that it enjoys small-loss PAC bounds, which exhibit a novel robustness property. For both online and offline RL, our results provide the first theoretical benefits of learning distributions even when we only need the mean for making decisions.
翻译:尽管分布式强化学习已在实践中取得显著成功,但其为何以及何时具有优势的问题仍未得到解答。本文通过小损失界(small-loss bounds)的视角为分布式强化学习的优势提供了一种解释——这些界与实例相关的优化成本成比例。当最优成本较小时,我们的界比非分布式方法得到的界更优。作为初步研究,我们证明了在上下文赌博机中学习成本分布可带来小损失遗憾界,并发现分布式上下文赌博机在三个具有挑战性的任务上显著优于现有最优方法。针对在线强化学习,我们提出了一种基于最大似然估计构建置信集的分布式版本空间算法,并证明该算法在表格型马尔可夫决策过程中可实现小损失遗憾,且在隐变量模型中享有小损失PAC界。基于类似见解,我们提出了一种基于悲观原则的分布式离线强化学习算法,并证明其享有具备新颖鲁棒性的小损失PAC界。对于在线与离线强化学习,我们的结果首次从理论上揭示了即便仅需均值进行决策时,学习分布仍能带来益处。