Contextual Markov decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. While CMDPs serve as an important framework to model many real-world applications with time-varying environments, they are largely unexplored from theoretical perspective. In this paper, we study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights. For both models, we propose novel model-based algorithms and show that they enjoy guaranteed $\epsilon$-suboptimality gap with desired polynomial sample complexity. In particular, instantiating our result for the first model to the tabular CMDP improves the existing result by removing the reachability assumption. Our result for the second model is the first-known result for such a type of function approximation models. Comparison between our results for the two models further indicates that having context-varying features leads to much better sample efficiency than having common representations for all contexts under linear CMDPs.
翻译:上下文马尔可夫决策过程(CMDPs)描述了一类强化学习问题,其中转移核与奖励函数可随时间动态变化,不同马尔可夫决策过程由上下文变量索引。尽管CMDPs为诸多时变环境下的现实应用提供了重要建模框架,但从理论角度而言其研究仍存在大量空白。本文针对两种线性函数逼近模型研究CMDPs:模型I采用随上下文变化的表征及所有上下文共享的线性权重;模型II采用所有上下文共享的表征及随上下文变化的线性权重。针对两种模型,我们提出新型基于模型的算法,并证明其能够以期望的多项式样本复杂度实现具有保证的$\epsilon$-次优性差距。特别地,将第一模型的结果实例化至表格型CMDP时,通过去除可达性假设改进了现有结论。针对第二模型的结果是该类函数逼近模型的首个已知结论。对两种模型结果的进一步对比表明,在线性CMDPs中,采用随上下文变化的特征比所有上下文共享表征能显著提升样本效率。