Real-world reinforcement learning (RL) is often severely limited since typical RL algorithms heavily rely on the reset mechanism to sample proper initial states. In practice, the reset mechanism is expensive to implement due to the need for human intervention or heavily engineered environments. To make learning more practical, we propose a generic no-regret reduction to systematically design reset-free RL algorithms. Our reduction turns reset-free RL into a two-player game. We show that achieving sublinear regret in this two-player game would imply learning a policy that has both sublinear performance regret and sublinear total number of resets in the original RL problem. This means that the agent eventually learns to perform optimally and avoid resets. By this reduction, we design an instantiation for linear Markov decision processes, which is the first provably correct reset-free RL algorithm to our knowledge.
翻译:现实世界的强化学习常常受到严重限制,因为典型强化学习算法高度依赖重置机制来采样合适的初始状态。在实践中,重置机制因需要人工干预或高度工程化的环境而难以实现。为使学习更具实用性,我们提出了一种通用的无遗憾归约方法,用于系统性地设计无重置强化学习算法。该归约将无重置强化学习转化为一个双人博弈问题。我们证明,在此双人博弈中实现亚线性遗憾,意味着在原强化学习问题中学到一个策略,该策略同时具有亚线性性能遗憾和亚线性总重置次数。这表明智能体最终能学会最优执行并避免重置。通过此归约,我们为线性马尔可夫决策过程设计了具体的算法实现,据我们所知,这是首个可证明正确的无重置强化学习算法。