While distributional reinforcement learning (RL) has demonstrated empirical success, the question of when and why it is beneficial has remained unanswered. In this work, we provide one explanation for the benefits of distributional RL through the lens of small-loss bounds, which scale with the instance-dependent optimal cost. If the optimal cost is small, our bounds are stronger than those from non-distributional approaches. As warmup, we show that learning the cost distribution leads to small-loss regret bounds in contextual bandits (CB), and we find that distributional CB empirically outperforms the state-of-the-art on three challenging tasks. For online RL, we propose a distributional version-space algorithm that constructs confidence sets using maximum likelihood estimation, and we prove that it achieves small-loss regret in the tabular MDPs and enjoys small-loss PAC bounds in latent variable models. Building on similar insights, we propose a distributional offline RL algorithm based on the pessimism principle and prove that it enjoys small-loss PAC bounds, which exhibit a novel robustness property. For both online and offline RL, our results provide the first theoretical benefits of learning distributions even when we only need the mean for making decisions.
翻译:尽管分布式强化学习已展现出实证上的成功,但其在何种条件下及为何具有优势的问题仍未得到解答。本研究通过小损失界(与实例相关的最优代价成比例)的视角,为分布式强化学习的优势提供了一种解释。若最优代价较小,我们的界限将优于非分布式方法的结果。作为预热,我们证明在情境赌博机中学习代价分布可带来小损失遗憾界,并发现分布式情境赌博机在三个挑战性任务上优于现有最优方法。针对在线强化学习,我们提出一种基于最大似然估计构建置信集的分布式版本空间算法,并证明其在表格型马尔可夫决策过程中实现小损失遗憾界,且在潜变量模型中享有小损失概率近似正确界。基于类似见解,我们提出一种基于悲观主义原则的分布式离线强化学习算法,并证明其享有表现出新型鲁棒性的小损失概率近似正确界。对于在线与离线强化学习,我们的结果首次从理论上证明了即使仅需均值进行决策时,学习分布仍具有优势。