We investigate the problems of model estimation and reward-free learning in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP. We then study the problem of learning near-optimal policies in the reward-free framework. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible rate. Interestingly, our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of possible contexts.
翻译:我们研究了情节式块MDP中的模型估计和无奖励学习问题。在这些MDP中,决策者能够访问由少量潜在状态生成的丰富观测或上下文。我们首先关注基于固定行为策略生成的数据来估计潜在状态解码函数(从观测到潜在状态的映射)。我们推导了估计该函数的错误率的信息论下界,并提出了一种接近这一基本极限的算法。相应地,我们的算法也提供了MDP所有组件的估计。随后,我们研究了在无奖励框架下学习近最优策略的问题。基于我们高效的模型估计算法,我们证明可以推断出一个收敛(随着收集的样本数量增加)到最优策略的策略,且收敛速率达到最优。有趣的是,我们的分析提供了利用块结构在识别近最优策略时提升样本复杂度的充分必要条件。当这些条件满足时,极小极大无奖励设置下的样本复杂度将乘以一个因子$n$,其中$n$是可能的上下文数量。