The problem of continual learning in the domain of reinforcement learning, often called non-stationary reinforcement learning, has been identified as an important challenge to the application of reinforcement learning. We prove a worst-case complexity result, which we believe captures this challenge: Modifying the probabilities or the reward of a single state-action pair in a reinforcement learning problem requires an amount of time almost as large as the number of states in order to keep the value function up to date, unless the strong exponential time hypothesis (SETH) is false; SETH is a widely accepted strengthening of the P $\neq$ NP conjecture. Recall that the number of states in current applications of reinforcement learning is typically astronomical. In contrast, we show that just $\textit{adding}$ a new state-action pair is considerably easier to implement.
翻译:在强化学习领域中的持续学习问题,通常被称为非平稳强化学习,已被确认为强化学习应用的重要挑战。我们证明了一个最坏情况下的复杂度结果,该结果我们认为能捕捉到这一挑战:除非强指数时间假设(SETH)不成立,否则修改强化学习问题中单个状态-动作对的概率或奖励所需的时间几乎与状态数量相当,才能保持价值函数更新;SETH是广为接受的P≠NP猜想的加强形式。需要注意的是,当前强化学习应用中的状态数量通常极其庞大。相比之下,我们表明仅添加一个新的状态-动作对实现起来要容易得多。