We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.
翻译:我们引入了动态上下文马尔可夫决策过程(DCMDPs),这是一种适用于历史依赖环境的新型强化学习框架,它将上下文MDP框架推广到处理非马尔可夫环境,其中上下文随时间变化。我们考虑了该模型的特例,重点关注逻辑DCMDPs,该特例通过利用聚合函数确定上下文转换,打破了历史长度的指数依赖。这一特殊结构使我们能够推导出一种基于置信上界风格的算法,并为其建立了遗憾界。受理论结果的启发,我们提出了一种实用的基于模型的逻辑DCMDPs算法,该算法在潜在空间中进行规划,并利用关于历史依赖特征的乐观主义。我们在一项推荐任务上(使用MovieLens数据)展示了所提方法的有效性,该任务中用户行为动态会根据推荐结果而演变。