Careful at Estimation and Bold at Exploration

Exploration strategies in continuous action space are often heuristic due to the infinite actions, and these kinds of methods cannot derive a general conclusion. In prior work, it has been shown that policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning(DPRL). However, policy-based exploration in DPRL has two prominent issues: aimless exploration and policy divergence, and the policy gradient for exploration is only sometimes helpful due to inaccurate estimation. Based on the double-Q function framework, we introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient. We first propose the greedy Q softmax update schema for Q value update. The expected Q value is derived by weighted summing the conservative Q value over actions, and the weight is the corresponding greedy Q value. Greedy Q takes the maximum value of the two Q functions, and conservative Q takes the minimum value of the two different Q functions. For practicality, this theoretical basis is then extended to allow us to combine action exploration with the Q value update, except for the premise that we have a surrogate policy that behaves like this exploration policy. In practice, we construct such an exploration policy with a few sampled actions, and to meet the premise, we learn such a surrogate policy by minimizing the KL divergence between the target policy and the exploration policy constructed by the conservative Q. We evaluate our method on the Mujoco benchmark and demonstrate superior performance compared to previous state-of-the-art methods across various environments, particularly in the most complex Humanoid environment.

翻译：连续动作空间中的探索策略通常因无限动作而具有启发性，且此类方法无法推导出通用结论。先前研究表明，在确定性策略强化学习（DPRL）中，基于策略的探索对连续动作空间是有益的。然而，DPRL中的基于策略的探索存在两个突出问题：无目的的探索和策略发散，并且由于估计不准确，用于探索的策略梯度有时帮助有限。基于双Q函数框架，我们引入了一种新的探索策略以缓解这些问题，该策略独立于策略梯度。我们首先提出贪婪Q软最大更新模式用于Q值更新。期望Q值通过对动作的保守Q值进行加权求和得出，权重为相应的贪婪Q值。贪婪Q取两个Q函数中的最大值，而保守Q取两个不同Q函数中的最小值。出于实用性考虑，我们将这一理论基础进一步扩展，使得我们能够将动作探索与Q值更新相结合，前提是我们拥有一个行为类似于该探索策略的代理策略。在实践中，我们通过少量采样动作构建这样的探索策略，并且为了满足前提，我们通过最小化目标策略与由保守Q构建的探索策略之间的KL散度来学习这样的代理策略。我们在Mujoco基准上评估了我们的方法，并在各种环境中（尤其是在最复杂的Humanoid环境中）展示了相比先前最先进方法的优越性能。