Interpretability of reinforcement learning policies is essential for many real-world tasks but learning such interpretable policies is a hard problem. Particularly rule-based policies such as decision trees and rules lists are difficult to optimize due to their non-differentiability. While existing techniques can learn verifiable decision tree policies there is no guarantee that the learners generate a decision that performs optimally. In this work, we study the optimization of size-limited decision trees for Markov Decision Processes (MPDs) and propose OMDTs: Optimal MDP Decision Trees. Given a user-defined size limit and MDP formulation OMDT directly maximizes the expected discounted return for the decision tree using Mixed-Integer Linear Programming. By training optimal decision tree policies for different MDPs we empirically study the optimality gap for existing imitation learning techniques and find that they perform sub-optimally. We show that this is due to an inherent shortcoming of imitation learning, namely that complex policies cannot be represented using size-limited trees. In such cases, it is better to directly optimize the tree for expected return. While there is generally a trade-off between the performance and interpretability of machine learning models, we find that OMDTs limited to a depth of 3 often perform close to the optimal limit.
翻译:强化学习策略的可解释性对许多实际任务至关重要,但学习这类可解释策略是一个难题。特别是基于规则的策略(如决策树和规则列表)因其不可微性而难以优化。现有技术虽能学习可验证的决策树策略,但无法保证生成者能产生最优决策。本文针对马尔科夫决策过程(MDP)研究尺寸受限决策树的优化问题,并提出OMDT:最优MDP决策树。给定用户定义的尺寸限制与MDP公式化表述,OMDT通过混合整数线性规划直接最大化决策树的期望折扣回报。通过为不同MDP训练最优决策树策略,我们实证研究了现有模仿学习技术的最优性差距,发现其表现次优。我们证明这是由于模仿学习存在固有缺陷:复杂策略无法用尺寸受限的决策树表示。在此类情形下,直接优化决策树的期望回报更为有效。尽管机器学习模型普遍存在性能与可解释性之间的权衡,但我们发现深度限制为3的OMDT通常能接近最优性能上限。