The class of deep deterministic off-policy algorithms is effectively applied to solve challenging continuous control problems. Current approaches commonly utilize random noise as an exploration method, which has several drawbacks, including the need for manual adjustment for a given task and the absence of exploratory calibration during the training process. We address these challenges by proposing a novel guided exploration method that uses an ensemble of Monte Carlo Critics for calculating exploratory action correction. The proposed method enhances the traditional exploration scheme by dynamically adjusting exploration. Subsequently, we present a novel algorithm that leverages the proposed exploratory module for both policy and critic modification. The presented algorithm demonstrates superior performance compared to modern reinforcement learning algorithms across a variety of problems in the DMControl suite.
翻译:深度确定性离策略算法类被有效应用于解决具有挑战性的连续控制问题。现有方法通常采用随机噪声作为探索手段,但存在若干缺陷,包括针对特定任务需手动调整参数以及训练过程中缺乏探索校准机制。我们提出一种新颖的导向探索方法来解决这些挑战,该方法利用蒙特卡洛评论家集成计算探索性动作校正。所提出的方法通过动态调整探索机制改进了传统探索方案。进一步地,我们提出一种新算法,将所提出的探索模块同时用于策略和评论家优化。在DMControl套件的多种问题上,该算法相比现代强化学习算法展现出更优越的性能。