As Reinforcement Learning (RL) agents are increasingly employed in diverse decision-making problems using reward preferences, it becomes important to ensure that policies learned by these frameworks in mapping observations to a probability distribution of the possible actions are explainable. However, there is little to no work in the systematic understanding of these complex policies in a contrastive manner, i.e., what minimal changes to the policy would improve/worsen its performance to a desired level. In this work, we present COUNTERPOL, the first framework to analyze RL policies using counterfactual explanations in the form of minimal changes to the policy that lead to the desired outcome. We do so by incorporating counterfactuals in supervised learning in RL with the target outcome regulated using desired return. We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL. Extensive empirical analysis shows the efficacy of COUNTERPOL in generating explanations for (un)learning skills while keeping close to the original policy. Our results on five different RL environments with diverse state and action spaces demonstrate the utility of counterfactual explanations, paving the way for new frontiers in designing and developing counterfactual policies.
翻译:随着基于奖励偏好的强化学习智能体越来越多地应用于各类决策问题,确保这些框架在将观测映射到可能动作概率分布时习得的策略具有可解释性变得愈发重要。然而,目前几乎缺乏以对比方式系统理解这些复杂策略的研究工作,即对策略进行何种最小改动才能将其性能提升/降低至期望水平。本文提出COUNTERPOL——首个通过反事实解释分析强化学习策略的框架,该框架以最小策略改动形式实现预期结果。我们通过将反事实机制融入强化学习中的监督学习,并利用期望回报调控目标结果来实现这一目标。我们建立了COUNTERPOL与强化学习中广泛使用的信赖域策略优化方法之间的理论联系。大量实证分析表明,COUNTERPOL在生成技能学习/反学习的解释时,能有效保持原始策略的接近性。我们在五个具有不同状态与动作空间的强化学习环境上的实验结果,证明了反事实解释的实用性,为设计和开发反事实策略开辟了新的前沿方向。