Counterfactual explanations are a common tool to explain artificial intelligence models. For Reinforcement Learning (RL) agents, they answer "Why not?" or "What if?" questions by illustrating what minimal change to a state is needed such that an agent chooses a different action. Generating counterfactual explanations for RL agents with visual input is especially challenging because of their large state spaces and because their decisions are part of an overarching policy, which includes long-term decision-making. However, research focusing on counterfactual explanations, specifically for RL agents with visual input, is scarce and does not go beyond identifying defective agents. It is unclear whether counterfactual explanations are still helpful for more complex tasks like analyzing the learned strategies of different agents or choosing a fitting agent for a specific task. We propose a novel but simple method to generate counterfactual explanations for RL agents by formulating the problem as a domain transfer problem which allows the use of adversarial learning techniques like StarGAN. Our method is fully model-agnostic and we demonstrate that it outperforms the only previous method in several computational metrics. Furthermore, we show in a user study that our method performs best when analyzing which strategies different agents pursue.
翻译:反事实解释是解释人工智能模型的一种常见工具。对于强化学习(RL)智能体,它们通过说明需要将状态进行何种最小改动才能让智能体选择另一种行动,来回答"为什么不?"或"如果……会怎样?"的问题。为具有视觉输入的RL智能体生成反事实解释尤其具有挑战性,因为其状态空间巨大,且其决策是包含长期决策的总体策略的一部分。然而,针对RL智能体(尤其是具有视觉输入的智能体)反事实解释的研究十分匮乏,且尚未超越识别缺陷智能体的范畴。目前尚不清楚反事实解释是否仍有助于更复杂的任务,例如分析不同智能体的学习策略,或为特定任务选择合适的智能体。我们提出了一种新颖但简单的方法,通过将问题表述为领域迁移问题来生成RL智能体的反事实解释,从而能够使用StarGAN等对抗性学习技术。我们的方法完全与模型无关,并证明其在多项计算指标上优于此前唯一的方法。此外,我们在用户研究中表明,该方法在分析不同智能体所采用的策略时表现最佳。