Deep Reinforcement Learning (DRL) has achieved remarkable success in sequential decision-making problems. However, existing DRL agents make decisions in an opaque fashion, hindering the user from establishing trust and scrutinizing weaknesses of the agents. While recent research has developed Interpretable Policy Extraction (IPE) methods for explaining how an agent takes actions, their explanations are often inconsistent with the agent's behavior and thus, frequently fail to explain. To tackle this issue, we propose a novel method, Fidelity-Induced Policy Extraction (FIPE). Specifically, we start by analyzing the optimization mechanism of existing IPE methods, elaborating on the issue of ignoring consistency while increasing cumulative rewards. We then design a fidelity-induced mechanism by integrate a fidelity measurement into the reinforcement learning feedback. We conduct experiments in the complex control environment of StarCraft II, an arena typically avoided by current IPE methods. The experiment results demonstrate that FIPE outperforms the baselines in terms of interaction performance and consistency, meanwhile easy to understand.
翻译:深度强化学习在序列决策问题中取得了显著成功。然而,现有深度强化学习代理的决策过程不透明,阻碍了用户建立信任并审查代理的弱点。尽管近期研究开发了可解释策略提取方法以解释代理如何采取行动,但其解释常与代理行为不一致,因此难以有效说明问题。为解决该问题,我们提出了一种新颖方法——基于保真度的可解释策略提取。具体而言,我们首先分析现有可解释策略提取方法的优化机制,详细阐述其在累积奖励增加过程中忽视一致性的问题;随后,通过将保真度度量融入强化学习反馈,设计了一种保真度诱导机制。我们在现有可解释策略提取方法通常回避的复杂控制环境《星际争霸II》中开展实验。结果表明,基于保真度的可解释策略提取在交互性能与一致性方面均优于基线方法,同时易于理解。