While Distributional Reinforcement Learning (DRL) methods have demonstrated strong performance in online settings, its success in offline scenarios remains limited. We hypothesize that a key limitation of existing offline DRL methods lies in their approach to uniformly underestimate return quantiles. This uniform pessimism can lead to overly conservative value estimates, ultimately hindering generalization and performance. To address this, we introduce a novel concept called quantile distortion, which enables non-uniform pessimism by adjusting the degree of conservatism based on the availability of supporting data. Our approach is grounded in theoretical analysis and empirically validated, demonstrating improved performance over uniform pessimism.
翻译:尽管分布强化学习方法在在线场景中已展现出卓越性能,但其在离线环境中的成功应用仍十分有限。我们认为现有离线分布强化学习方法的主要局限在于其均匀低估回报分位数的处理方式。这种均匀悲观性可能导致价值估计过于保守,最终阻碍模型的泛化能力与性能表现。为解决这一问题,我们提出了一种称为分位数扭曲的新概念,该方法通过根据支撑数据的可用性调整保守程度,从而实现非均匀的悲观性估计。我们的方法建立在理论分析基础之上,并经过实证验证,结果表明其性能相较于均匀悲观方法有显著提升。