Context detection involves labeling segments of an online stream of data as belonging to different tasks. Task labels are used in lifelong learning algorithms to perform consolidation or other procedures that prevent catastrophic forgetting. Inferring task labels from online experiences remains a challenging problem. Most approaches assume finite and low-dimension observation spaces or a preliminary training phase during which task labels are learned. Moreover, changes in the transition or reward functions can be detected only in combination with a policy, and therefore are more difficult to detect than changes in the input distribution. This paper presents an approach to learning both policies and labels in an online deep reinforcement learning setting. The key idea is to use distance metrics, obtained via optimal transport methods, i.e., Wasserstein distance, on suitable latent action-reward spaces to measure distances between sets of data points from past and current streams. Such distances can then be used for statistical tests based on an adapted Kolmogorov-Smirnov calculation to assign labels to sequences of experiences. A rollback procedure is introduced to learn multiple policies by ensuring that only the appropriate data is used to train the corresponding policy. The combination of task detection and policy deployment allows for the optimization of lifelong reinforcement learning agents without an oracle that provides task labels. The approach is tested using two benchmarks and the results show promising performance when compared with related context detection algorithms. The results suggest that optimal transport statistical methods provide an explainable and justifiable procedure for online context detection and reward optimization in lifelong reinforcement learning.
翻译:上下文检测涉及将在线数据流的片段标记为属于不同的任务。任务标签在终身学习算法中用于执行整合或其他防止灾难性遗忘的程序。从在线经验中推断任务标签仍然是一个具有挑战性的问题。大多数方法假设观测空间是有限且低维的,或者存在一个初步训练阶段,在此期间学习任务标签。此外,转移函数或奖励函数的变化只能结合策略来检测,因此比输入分布的变化更难检测。本文提出了一种在在线深度强化学习设置中同时学习策略和标签的方法。其核心思想是,在合适的潜在动作-奖励空间上,使用通过最优传输方法(即Wasserstein距离)获得的距离度量,来测量来自过去和当前数据流的数据点集之间的距离。然后,这些距离可用于基于改进的Kolmogorov-Smirnov计算的统计检验,从而为经验序列分配标签。本文引入了一种回滚程序,通过确保仅使用适当的数据来训练相应的策略,从而学习多个策略。任务检测和策略部署的结合,使得无需提供任务标签的预言机即可优化终身强化学习智能体。该方法使用两个基准进行了测试,结果显示,与相关的上下文检测算法相比,其性能表现良好。结果表明,最优传输统计方法为终身强化学习中的在线上下文检测和奖励优化提供了一种可解释且可论证的程序。