We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.
翻译:我们提出了动态上下文马尔可夫决策过程(DCMDPs),这是一种用于历史依赖环境的新型强化学习框架。该框架将上下文MDP框架泛化至非马尔可夫环境,其中上下文随时间动态变化。我们研究了模型的特例,重点关注逻辑DCMDP模型——该模型通过采用聚合函数确定上下文转移,打破了历史长度呈指数级依赖的限制。这种特殊结构使我们能够推导出基于上置信界风格的算法,并为其建立了遗憾界。受理论成果启发,我们针对逻辑DCMDP模型提出了一种实用基于模型的算法,该算法在潜在空间中进行规划,并对历史依赖特征应用乐观主义原则。通过在基于MovieLens数据的推荐任务中验证,我们证明了该方法的有效性——其中用户行为动态会随推荐结果发生演化。