Online Reinforcement Learning (RL) is typically framed as the process of minimizing cumulative regret (CR) through interactions with an unknown environment. However, real-world RL applications usually involve a sequence of tasks, and the data collected in the first task is used to warm-start the second task. The performance of the warm-start policy is measured by simple regret (SR). While minimizing both CR and SR is generally a conflicting objective, previous research has shown that in stationary environments, both can be optimized in terms of the duration of the task, $T$. In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading to a CR bound worse than the typical optimal rate of $T^{1/2}$. These findings are practically significant, indicating that increased exploration is necessary in non-stationary environments to accommodate task changes, impacting the design of RL algorithms in fields such as healthcare and beyond.
翻译:在线强化学习通常被定义为通过与未知环境的交互来最小化累积遗憾的过程。然而,现实世界的强化学习应用通常涉及一系列任务,且第一个任务中收集的数据会被用于第二个任务的预热启动。预热启动策略的性能通过简单遗憾来衡量。尽管最小化累积遗憾和简单遗憾通常是一个相互冲突的目标,先前研究表明,在平稳环境中,两者均可在任务时长$T$的意义上实现优化。然而在实际应用中,任务间的人机交互决策往往导致非平稳性。例如,在临床试验中,科学家可能在实施过程中调整目标健康指标。我们的研究结果表明,任务非平稳性会导致累积遗憾与简单遗憾之间出现更为严格的权衡。为平衡这些相互竞争的目标,算法必须进行过度探索,从而导致累积遗憾界劣于典型的最优速率$T^{1/2}$。这些发现具有重要的实践意义,表明在非平稳环境中需要增加探索以适应任务变化,这将影响医疗健康等领域强化学习算法的设计。