Causal State Distillation for Explainable Reinforcement Learning

Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing problem, making it difficult for users to grasp the reasons behind an agent's behaviour. Various approaches have been explored to address this problem, with one promising avenue being reward decomposition (RD). RD is appealing as it sidesteps some of the concerns associated with other methods that attempt to rationalize an agent's behaviour in a post-hoc manner. RD works by exposing various facets of the rewards that contribute to the agent's objectives during training. However, RD alone has limitations as it primarily offers insights based on sub-rewards and does not delve into the intricate cause-and-effect relationships that occur within an RL agent's neural model. In this paper, we present an extension of RD that goes beyond sub-rewards to provide more informative explanations. Our approach is centred on a causal learning framework that leverages information-theoretic measures for explanation objectives that encourage three crucial properties of causal factors: \emph{causal sufficiency}, \emph{sparseness}, and \emph{orthogonality}. These properties help us distill the cause-and-effect relationships between the agent's states and actions or rewards, allowing for a deeper understanding of its decision-making processes. Our framework is designed to generate local explanations and can be applied to a wide range of RL tasks with multiple reward channels. Through a series of experiments, we demonstrate that our approach offers more meaningful and insightful explanations for the agent's action selections.

翻译：强化学习（RL）是训练智能体的一项强大技术，但理解智能体为何做出特定决策却颇具挑战。RL模型缺乏透明度这一长期存在的问题，使用户难以把握智能体行为背后的原因。针对该问题已有多种方法被探索，其中奖励分解（RD）是一种有前景的途径。RD之所以吸引人，是因为它规避了其他事后合理化智能体行为方法所涉及的部分顾虑。RD通过揭示训练过程中构成智能体目标的各类奖励侧面来发挥作用。然而，RD本身存在局限性，因为它主要基于子奖励提供见解，而未深入探究RL智能体神经模型内部发生的复杂因果关联。本文提出RD的扩展方法，超越子奖励以提供更具信息量的解释。我们的方法核心在于一个因果学习框架，该框架利用信息论度量来构建解释目标，从而鼓励因果因素的三个关键属性：\emph{因果充分性}、\emph{稀疏性}和\emph{正交性}。这些属性有助于提炼智能体状态与动作或奖励之间的因果关系，从而更深入地理解其决策过程。该框架旨在生成局部解释，并可应用于具有多奖励通道的各类RL任务。通过一系列实验，我们证明该方法为智能体的动作选择提供了更具意义和洞察力的解释。