Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments.
翻译:大多数离策略强化学习算法采用高估偏差控制技术。这些技术大多依赖于启发式方法,主要解决高估的后果而非其根本原因。本文提出一种新颖的偏差校正方法,其思路与双Q学习类似。我们提出使用具有两个分量的混合策略形式。每个策略分量由独立的网络进行评估并最大化,从而消除了高估偏差产生的任何基础。我们的方法在少量MuJoCo环境上展示了接近当前最优水平的具有前景的结果。