The varying significance of distinct primitive behaviors during the policy learning process has been overlooked by prior model-free RL algorithms. Leveraging this insight, we explore the causal relationship between different action dimensions and rewards to evaluate the significance of various primitive behaviors during training. We introduce a causality-aware entropy term that effectively identifies and prioritizes actions with high potential impacts for efficient exploration. Furthermore, to prevent excessive focus on specific primitive behaviors, we analyze the gradient dormancy phenomenon and introduce a dormancy-guided reset mechanism to further enhance the efficacy of our method. Our proposed algorithm, ACE: Off-policy Actor-critic with Causality-aware Entropy regularization, demonstrates a substantial performance advantage across 29 diverse continuous control tasks spanning 7 domains compared to model-free RL baselines, which underscores the effectiveness, versatility, and efficient sample efficiency of our approach. Benchmark results and videos are available at https://ace-rl.github.io/.
翻译:在先前的无模型强化学习算法中,策略学习过程中不同原始行为的重要性差异常被忽略。基于这一洞察,我们探索了不同动作维度与奖励之间的因果关系,以评估训练过程中各类原始行为的重要性。我们引入了一种因果感知熵项,能够有效识别并优先处理具有高潜在影响力的动作,以实现高效探索。此外,为避免对特定原始行为的过度关注,我们分析了梯度休眠现象,并引入了一种基于休眠引导的重置机制,以进一步提升方法的有效性。我们提出的算法ACE(离策略演员-评论家结合因果感知熵正则化)在覆盖7个领域的29项多样化连续控制任务中,相较于无模型强化学习基线方法展现了显著的性能优势,这充分证明了我们方法的有效性、通用性及高样本效率。基准测试结果与视频详见https://ace-rl.github.io/。