While reinforcement learning (RL) algorithms have been successfully applied to numerous tasks, their reliance on neural networks makes their behavior difficult to understand and trust. Counterfactual explanations are human-friendly explanations that offer users actionable advice on how to alter the model inputs to achieve the desired output from a black-box system. However, current approaches to generating counterfactuals in RL ignore the stochastic and sequential nature of RL tasks and can produce counterfactuals which are difficult to obtain or do not deliver the desired outcome. In this work, we propose RACCER, the first RL-specific approach to generating counterfactual explanations for the behaviour of RL agents. We first propose and implement a set of RL-specific counterfactual properties that ensure easily reachable counterfactuals with highly-probable desired outcomes. We use a heuristic tree search of agent's execution trajectories to find the most suitable counterfactuals based on the defined properties. We evaluate RACCER in two tasks as well as conduct a user study to show that RL-specific counterfactuals help users better understand agent's behavior compared to the current state-of-the-art approaches.
翻译:尽管强化学习算法已成功应用于众多任务,但其对神经网络的依赖导致其行为难以理解与信任。反事实解释作为一种面向人类用户的解释方法,能为用户提供可操作的建议,指导其如何调整模型输入,从而从黑盒系统中获得预期输出。然而,当前针对强化学习生成反事实解释的方法忽略了强化学习任务的随机性与序列性特征,可能产生难以达成的反事实或无法实现预期结果。本研究提出RACCER——首个针对强化学习智能体行为生成反事实解释的专用方法。我们首先提出并实现了一套面向强化学习的反事实属性定义,确保生成的反事实易于达成且具有高概率的预期结果。基于该属性定义,我们采用智能体执行轨迹的启发式树搜索来寻找最优反事实。通过在两项任务中的评估及用户研究,结果表明:与现有最优方法相比,面向强化学习的反事实解释能更有效地帮助用户理解智能体行为。