Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency, resulting from blind exploration strategies that neglect causal relationships among states, actions, and rewards. Although recent causal approaches aim to address this problem, they lack grounded modeling of reward-guided causal understanding of states and actions for goal-orientation, thus impairing learning efficiency. To tackle this issue, we propose a novel method named Causal Information Prioritization (CIP) that improves sample efficiency by leveraging factored MDPs to infer causal relationships between different dimensions of states and actions with respect to rewards, enabling the prioritization of causal information. Specifically, CIP identifies and leverages causal relationships between states and rewards to execute counterfactual data augmentation to prioritize high-impact state features under the causal understanding of the environments. Moreover, CIP integrates a causality-aware empowerment learning objective, which significantly enhances the agent's execution of reward-guided actions for more efficient exploration in complex environments. To fully assess the effectiveness of CIP, we conduct extensive experiments across 39 tasks in 5 diverse continuous control environments, encompassing both locomotion and manipulation skills learning with pixel-based and sparse reward settings. Experimental results demonstrate that CIP consistently outperforms existing RL methods across a wide range of scenarios.
翻译:当前强化学习方法常因忽视状态、动作与奖励间因果关系的盲目探索策略而存在样本效率低下的问题。尽管近期因果方法试图解决此问题,但其缺乏面向目标导向的、基于奖励引导的状态与动作因果理解的建模基础,从而损害了学习效率。为解决这一问题,我们提出了一种名为因果信息优先化的新方法,该方法通过利用分解马尔可夫决策过程来推断状态与动作各维度相对于奖励的因果关系,从而实现对因果信息的优先化处理,进而提升样本效率。具体而言,CIP 识别并利用状态与奖励间的因果关系,在环境因果理解的基础上执行反事实数据增强,以优先处理高影响力的状态特征。此外,CIP 集成了一个因果感知的赋能学习目标,显著增强了智能体执行奖励引导动作的能力,从而在复杂环境中实现更高效的探索。为全面评估 CIP 的有效性,我们在 5 种不同的连续控制环境中对 39 项任务进行了广泛实验,涵盖了基于像素和稀疏奖励设置下的运动与操作技能学习。实验结果表明,CIP 在多种场景中均持续优于现有强化学习方法。