Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

from arxiv, This conference version of this paper refers to "Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees" in NeurIPS 2022

We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks.

翻译：本文考虑基于可观测到的已执行动作与已访问状态的历史记录，对人类代理的动态决策结构模型进行估计的问题。该问题具有固有的嵌套结构：在内层问题中，需为给定奖励函数识别最优策略；在外层问题中，需最大化拟合优度指标。已有多种方法被提出以减轻这种嵌套循环结构的计算负担，但这些方法在状态空间为高基数离散或高维连续时仍面临高复杂度问题。逆强化学习（IRL）领域的其他方法侧重于策略估计，但以牺牲奖励估计精度为代价。本文提出一种具有有限时间保证的单循环估计算法，该算法能够处理高维状态空间而不牺牲奖励估计精度。在所提算法中，每个策略改进步骤之后均跟随一个用于似然最大化的随机梯度步骤。我们证明所提算法能以有限时间保证收敛到平稳解。进一步地，若奖励函数为线性参数化形式，我们证明该算法能以次线性速率逼近最大似然估计量。最后，通过MuJoCo中的机器人控制问题及其迁移场景，我们证明所提算法相较于其他IRL和模仿学习基准方法实现了更优性能。