While reinforcement learning (RL) algorithms have been successfully applied to numerous tasks, their reliance on neural networks makes their behavior difficult to understand and trust. Counterfactual explanations are human-friendly explanations that offer users actionable advice on how to alter the model inputs to achieve the desired output from a black-box system. However, current approaches to generating counterfactuals in RL ignore the stochastic and sequential nature of RL tasks and can produce counterfactuals that are difficult to obtain or do not deliver the desired outcome. In this work, we propose RACCER, the first RL-specific approach to generating counterfactual explanations for the behavior of RL agents. We first propose and implement a set of RL-specific counterfactual properties that ensure easily reachable counterfactuals with highly probable desired outcomes. We use a heuristic tree search of the agent's execution trajectories to find the most suitable counterfactuals based on the defined properties. We evaluate RACCER in two tasks as well as conduct a user study to show that RL-specific counterfactuals help users better understand agents' behavior compared to the current state-of-the-art approaches.
翻译:摘要:尽管强化学习算法已成功应用于众多任务,但其对神经网络的依赖导致其行为难以理解且难以信赖。反事实解释是一种人性化的解释方式,可为用户提供可操作的建议,指导其如何调整模型输入,从而从黑箱系统中获得期望输出。然而,当前在强化学习中生成反事实解释的方法忽视了强化学习任务的随机性与序贯性,可能生成难以实现或无法达成预期结果的反事实实例。本文提出RACCER,这是首个针对强化学习智能体行为生成反事实解释的专用方法。我们首先提出并实现了一套强化学习特有的反事实属性,确保生成易实现且具有高概率期望结果的反事实实例。随后利用智能体执行轨迹的启发式树搜索,基于所定义的属性寻找最合适的反事实。我们在两项任务中评估了RACCER,并开展了用户研究,结果表明,相较于当前最先进方法,强化学习专用的反事实解释能帮助用户更深入地理解智能体行为。