In reinforcement learning, the optimism in the face of uncertainty (OFU) is a mainstream principle for directing exploration towards less explored areas, characterized by higher uncertainty. However, in the presence of environmental stochasticity (noise), purely optimistic exploration may lead to excessive probing of high-noise areas, consequently impeding exploration efficiency. Hence, in exploring noisy environments, while optimism-driven exploration serves as a foundation, prudent attention to alleviating unnecessary over-exploration in high-noise areas becomes beneficial. In this work, we propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a noise-aware optimistic exploration for continuous control. OVD-Explorer proposes a new measurement of the policy's exploration ability considering noise in optimistic perspectives, and leverages gradient ascent to drive exploration. Practically, OVD-Explorer can be easily integrated with continuous control RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks demonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic exploration.
翻译:在强化学习中,面对不确定性保持乐观(OFU)是引导探索向高不确定性未充分探索区域发展的主流原则。然而,在环境随机性(噪声)存在时,纯粹乐观的探索可能导致对高噪声区域的过度探测,从而阻碍探索效率。因此,在噪声环境探索中,尽管乐观驱动的探索是基础,但审慎关注减少高噪声区域不必要的过度探索将有所助益。本文提出乐观值分布探索器(OVD-Explorer),用于实现面向连续控制任务的噪声感知型乐观探索。OVD-Explorer从乐观视角提出衡量策略探索能力的新指标,并利用梯度上升驱动探索过程。实际应用中,OVD-Explorer可轻松与连续控制强化学习算法集成。在MuJoCo和GridChaos任务上的大量评估表明,OVD-Explorer在实现噪声感知型乐观探索方面具有显著优势。