As a key component to intuitive cognition and reasoning solutions in human intelligence, causal knowledge provides great potential for reinforcement learning (RL) agents' interpretability towards decision-making by helping reduce the searching space. However, there is still a considerable gap in discovering and incorporating causality into RL, which hinders the rapid development of causal RL. In this paper, we consider explicitly modeling the generation process of states with the causal graphical model, based on which we augment the policy. We formulate the causal structure updating into the RL interaction process with active intervention learning of the environment. To optimize the derived objective, we propose a framework with theoretical performance guarantees that alternates between two steps: using interventions for causal structure learning during exploration and using the learned causal structure for policy guidance during exploitation. Due to the lack of public benchmarks that allow direct intervention in the state space, we design the root cause localization task in our simulated fault alarm environment and then empirically show the effectiveness and robustness of the proposed method against state-of-the-art baselines. Theoretical analysis shows that our performance improvement attributes to the virtuous cycle of causal-guided policy learning and causal structure learning, which aligns with our experimental results.
翻译:作为人类智能中直观认知与推理解决方案的关键组成部分,因果知识通过帮助缩小搜索空间,为强化学习智能体决策的可解释性提供了巨大潜力。然而,在将因果发现与因果机制融入强化学习方面仍存在显著差距,这阻碍了因果强化学习的快速发展。本文考虑利用因果图模型对状态的生成过程进行显式建模,并以此为基础增强策略。我们将因果结构更新纳入强化学习交互过程,并引入对环境的主动干预学习。为优化所推导的目标函数,我们提出一个具有理论性能保证的框架,该框架交替执行两个步骤:在探索阶段通过干预进行因果结构学习,在利用阶段利用已学习的因果结构指导策略。由于缺乏允许对状态空间进行直接干预的公开基准,我们设计了模拟故障报警环境中的根因定位任务,并实证展示了所提方法相对于最先进基线的有效性与鲁棒性。理论分析表明,我们的性能提升归因于因果引导策略学习与因果结构学习之间的良性循环,这一结果与实验结论一致。