We introduce off-policy distributional Q($\lambda$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($\lambda$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($\lambda$) from other existing alternatives such as distributional Retrace. We characterize the algorithmic properties of distributional Q($\lambda$) and validate theoretical insights with tabular experiments. We show how distributional Q($\lambda$)-C51, a combination of Q($\lambda$) with the C51 agent, exhibits promising results on deep RL benchmarks.
翻译:我们提出离策略分布Q($\lambda$),这是离策略分布评估算法家族的新成员。离策略分布Q($\lambda$)在离策略学习中不应用重要性抽样,从而引入了与符号测度的有趣交互。这种独特属性使分布Q($\lambda$)区别于其他现有替代方案(如分布式Retrace)。我们刻画了分布Q($\lambda$)的算法特性,并通过表格实验验证了理论洞见。我们展示了分布Q($\lambda$)-C51(Q($\lambda$)与C51智能体的结合)在深度强化学习基准上展现出有前景的结果。