Conventional imitation learning assumes access to the actions of demonstrators, but these motor signals are often non-observable in naturalistic settings. Additionally, sequential decision-making behaviors in these settings can deviate from the assumptions of a standard Markov Decision Process (MDP). To address these challenges, we explore deep generative modeling of state-only sequences with non-Markov Decision Process (nMDP), where the policy is an energy-based prior in the latent space of the state transition generator. We develop maximum likelihood estimation to achieve model-based imitation, which involves short-run MCMC sampling from the prior and importance sampling for the posterior. The learned model enables \textit{decision-making as inference}: model-free policy execution is equivalent to prior sampling, model-based planning is posterior sampling initialized from the policy. We demonstrate the efficacy of the proposed method in a prototypical path planning task with non-Markovian constraints and show that the learned model exhibits strong performances in challenging domains from the MuJoCo suite.
翻译:传统模仿学习假设可获取演示者的动作信号,但在自然环境中这些运动信号往往不可观测。此外,此类环境中的序列决策行为可能偏离标准马尔可夫决策过程(MDP)的假设。为应对这些挑战,我们探索了基于非马尔可夫决策过程(nMDP)的纯状态序列深度生成建模,其中策略被构建为状态转移生成器潜在空间中的能量基先验。我们开发了基于模型模仿的最大似然估计方法,该方法包含从先验分布进行短程马尔可夫链蒙特卡洛采样,以及针对后验分布的重要性采样。所学的模型实现了“推理即决策”:无模型策略执行等价于先验采样,基于模型的规划则是以策略为初始化的后验采样。我们在具有非马尔可夫约束的典型路径规划任务中验证了所提方法的有效性,并展示了该模型在MuJoCo套件中的具有挑战性的领域表现出强大的性能。