We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs). In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs. To sidestep known impossibility results, we consider several notions of separation of the constituent MDPs. The main thrust of this paper is in establishing a nearly-sharp *statistical threshold* for the horizon length necessary for efficient learning. On the computational side, we show that under a weaker assumption of separability under the optimal policy, there is a quasi-polynomial algorithm with time complexity scaling in terms of the statistical threshold. We further show a near-matching time complexity lower bound under the exponential time hypothesis.
翻译:我们研究了学习隐马尔可夫决策过程(LMDPs)的计算与统计特性。在该模型中,学习者在每个训练周期开始时与一个从未知MDP混合分布中抽取的MDP进行交互。为了规避已知的不可能性结果,我们考虑了组成MDPs的几种分离性概念。本文的核心贡献在于为高效学习所需的时间步长(horizon length)建立了一个近乎尖锐的*统计阈值*。在计算方面,我们证明在最优策略下的较弱可分离性假设下,存在一个拟多项式时间算法,其时间复杂度依该统计阈值缩放。我们进一步基于指数时间假设给出了一个近乎匹配的时间复杂度下界。