The theoretical advantages of distributional reinforcement learning~(RL) over classical RL remain elusive despite its remarkable empirical performance. Starting from Categorical Distributional RL~(CDRL), we attribute the potential superiority of distributional RL to a derived distribution-matching regularization by applying a return density function decomposition technique. This unexplored regularization in the distributional RL context is aimed at capturing additional return distribution information regardless of only its expectation, contributing to an augmented reward signal in the policy optimization. Compared with the entropy regularization in MaxEnt RL that explicitly optimizes the policy to encourage the exploration, the resulting regularization in CDRL implicitly optimizes policies guided by the new reward signal to align with the uncertainty of target return distributions, leading to an uncertainty-aware exploration effect. Finally, extensive experiments substantiate the importance of this uncertainty-aware regularization in distributional RL on the empirical benefits over classical RL.
翻译:分布型强化学习(distributional RL)相较于经典强化学习在理论上的优势,尽管其经验性能显著,但至今仍难以捉摸。从分布型分类强化学习(CDRL)出发,我们通过应用回报密度函数分解技术,将分布型RL的潜在优越性归因于一种衍生的分布匹配正则化。这种在分布型RL背景下尚未被探索的正则化,旨在捕捉额外的回报分布信息,而不仅仅是其期望值,从而在策略优化中增强奖励信号。与最大熵强化学习(MaxEnt RL)中显式优化策略以鼓励探索的熵正则化相比,CDRL中的这种正则化隐式地优化策略,使其在新的奖励信号引导下与目标回报分布的不确定性对齐,从而产生一种不确定性感知的探索效果。最后,大量实验证实了这种不确定性感知正则化在分布型RL中对经典RL的经验优势的重要性。