Policies trained via reinforcement learning (RL) are often very complex even for simple tasks. In an episode with n time steps, a policy will make n decisions on actions to take, many of which may appear non-intuitive to the observer. Moreover, it is not clear which of these decisions directly contribute towards achieving the reward and how significant their contribution is. Given a trained policy, we propose a black-box method based on statistical covariance estimation that clusters the states of the environment and ranks each cluster according to the importance of decisions made in its states. We compare our measure against a previous statistical fault localization based ranking procedure.
翻译:通过强化学习训练的策略即使在简单任务中也往往非常复杂。在一个包含n个时间步的回合中,策略将对要采取的动作做出n次决策,其中许多决策对观察者而言可能显得不直观。此外,尚不清楚哪些决策直接有助于实现奖励,以及它们的贡献重要性如何。针对已训练的策略,我们提出一种基于统计协方差估计的黑盒方法,该方法对环境状态进行聚类,并根据在其状态下所做出的决策重要性对每个聚类进行排序。我们将我们的度量指标与先前基于统计故障定位的排序程序进行了比较。