Reinforcement learning (RL) so far has limited real-world applications. One key challenge is that typical RL algorithms heavily rely on a reset mechanism to sample proper initial states; these reset mechanisms, in practice, are expensive to implement due to the need for human intervention or heavily engineered environments. To make learning more practical, we propose a generic no-regret reduction to systematically design reset-free RL algorithms. Our reduction turns the reset-free RL problem into a two-player game. We show that achieving sublinear regret in this two-player game would imply learning a policy that has both sublinear performance regret and sublinear total number of resets in the original RL problem. This means that the agent eventually learns to perform optimally and avoid resets. To demonstrate the effectiveness of this reduction, we design an instantiation for linear Markov decision processes, which is the first provably correct reset-free RL algorithm.
翻译:强化学习(RL)迄今为止在现实世界中的应用十分有限。关键挑战在于,典型的RL算法严重依赖重置机制来采样合理的初始状态;而在实践中,由于需要人工干预或精心设计的运行环境,这些重置机制的实现成本高昂。为使学习过程更具实用性,我们提出了一种通用性的无悔约简方法,以系统性设计无重置RL算法。该约简方法将无重置RL问题转化为双人博弈问题。我们证明,在该双人博弈中实现次线性遗憾,意味着能学习到同时满足原始RL问题中性能遗憾次线性与重置总次数次线性这两个条件的策略。这表明智能体最终能学会最优执行并避免重置。为验证该约简方法的有效性,我们针对线性马尔可夫决策过程设计了具体实现方案,这是首个具有可证明正确性的无重置RL算法。