Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via Online High-Confidence Change-Point Detection

Non-stationary environments are challenging for reinforcement learning algorithms. If the state transition and/or reward functions change based on latent factors, the agent is effectively tasked with optimizing a behavior that maximizes performance over a possibly infinite random sequence of Markov Decision Processes (MDPs), each of which drawn from some unknown distribution. We call each such MDP a context. Most related works make strong assumptions such as knowledge about the distribution over contexts, the existence of pre-training phases, or a priori knowledge about the number, sequence, or boundaries between contexts. We introduce an algorithm that efficiently learns policies in non-stationary environments. It analyzes a possibly infinite stream of data and computes, in real-time, high-confidence change-point detection statistics that reflect whether novel, specialized policies need to be created and deployed to tackle novel contexts, or whether previously-optimized ones might be reused. We show that (i) this algorithm minimizes the delay until unforeseen changes to a context are detected, thereby allowing for rapid responses; and (ii) it bounds the rate of false alarm, which is important in order to minimize regret. Our method constructs a mixture model composed of a (possibly infinite) ensemble of probabilistic dynamics predictors that model the different modes of the distribution over underlying latent MDPs. We evaluate our algorithm on high-dimensional continuous reinforcement learning problems and show that it outperforms state-of-the-art (model-free and model-based) RL algorithms, as well as state-of-the-art meta-learning methods specially designed to deal with non-stationarity.

翻译：非静止环境对强化学习算法具有挑战性。如果国家过渡和(或)奖励功能基于潜在因素的变化,代理商有效地负责优化一种行为,在可能无限随机的Markov 决策进程(MDPs)中最大限度地提高业绩,每个进程都来自一些未知的分布。我们称每个MDP为背景。大多数相关工作都作出强有力的假设,例如了解背景分布、培训前阶段的存在,或事先了解背景之间的自由度、顺序或界限。如果我们引入一种高效学习非静止环境的政策的算法。它分析可能无限的数据和计算流,即实时的、高度自信变化点的检测数据,反映是否需要创建和部署新的专门政策,或是否以前最优化的政策可以再利用。我们显示:(一)这种算法最大限度地减少在无法预见的模型变化之前的延迟,从而能够作出迅速的反应;以及(二)它控制非静止的警报率,这对于最大限度地减少遗憾。我们的方法将稳定度和稳定度的模型的模型的模型构建成一种不朽的模型,我们用来预测一个不朽的模型的模型,我们用来预测一个不朽的模型的模型的模型,用来预测。