Causal State Distillation for Explainable Reinforcement Learning

Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing problem, making it difficult for users to grasp the reasons behind an agent's behaviour. Various approaches have been explored to address this problem, with one promising avenue being reward decomposition (RD). RD is appealing as it sidesteps some of the concerns associated with other methods that attempt to rationalize an agent's behaviour in a post-hoc manner. RD works by exposing various facets of the rewards that contribute to the agent's objectives during training. However, RD alone has limitations as it primarily offers insights based on sub-rewards and does not delve into the intricate cause-and-effect relationships that occur within an RL agent's neural model. In this paper, we present an extension of RD that goes beyond sub-rewards to provide more informative explanations. Our approach is centred on a causal learning framework that leverages information-theoretic measures for explanation objectives that encourage three crucial properties of causal factors: causal sufficiency, sparseness, and orthogonality. These properties help us distill the cause-and-effect relationships between the agent's states and actions or rewards, allowing for a deeper understanding of its decision-making processes. Our framework is designed to generate local explanations and can be applied to a wide range of RL tasks with multiple reward channels. Through a series of experiments, we demonstrate that our approach offers more meaningful and insightful explanations for the agent's action selections.

翻译：强化学习（RL）是训练智能体的一种强大技术，但理解这些智能体为何做出特定决策却颇具挑战性。RL模型缺乏透明性这一长期存在的问题，使用户难以把握智能体行为背后的原因。为应对此问题，研究人员探索了多种方法，其中奖励分解（RD）是一种有前景的途径。RD之所以引人注目，是因为它规避了其他事后解释智能体行为方法的相关问题。RD通过揭示训练过程中贡献于智能体目标的奖励的多个方面来发挥作用。然而，RD本身存在局限性，因为它主要基于子奖励提供见解，并未深入探讨RL智能体神经网络模型内部复杂的因果关系。本文提出了一种超越子奖励的RD扩展方法，以提供更具信息量的解释。我们的方法基于一个因果学习框架，该框架利用信息论度量构建解释目标，以鼓励因果因素的三个关键特性：因果充分性、稀疏性和正交性。这些特性有助于我们提炼智能体状态与动作或奖励之间的因果关系，从而更深入地理解其决策过程。我们的框架旨在生成局部解释，并可应用于具有多个奖励通道的各类RL任务。通过一系列实验，我们证明了该方法能为智能体的动作选择提供更有意义且更具洞察力的解释。