Obtaining no-regret guarantees for reinforcement learning (RL) in the case of problems with continuous state and/or action spaces is still one of the major open challenges in the field. Recently, a variety of solutions have been proposed, but besides very specific settings, the general problem remains unsolved. In this paper, we introduce a novel structural assumption on the Markov decision processes (MDPs), namely $\nu-$smoothness, that generalizes most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To face this challenging scenario, we propose two algorithms for regret minimization in $\nu-$smooth MDPs. Both algorithms build upon the idea of constructing an MDP representation through an orthogonal feature map based on Legendre polynomials. The first algorithm, \textsc{Legendre-Eleanor}, archives the no-regret property under weaker assumptions but is computationally inefficient, whereas the second one, \textsc{Legendre-LSVI}, runs in polynomial time, although for a smaller class of problems. After analyzing their regret properties, we compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
翻译:在具有连续状态和/或动作空间的问题中,实现强化学习的无遗憾保证仍是该领域的主要开放挑战之一。近年来,已有多种解决方案被提出,但除了一些非常特定的设定外,一般性问题仍未得到解决。本文对马尔可夫决策过程引入了一种新的结构假设——$\nu$-光滑性,该假设推广了迄今为止提出的大多数设定(例如,线性MDP和Lipschitz MDP)。为应对这一具有挑战性的场景,我们提出了两种用于$\nu$-光滑MDP中遗憾最小化的算法。这两种算法都基于通过勒让德多项式的正交特征映射构建MDP表示的思想。第一种算法\ textsc{Legendre-Eleanor}在较弱的假设下实现了无遗憾性质,但计算效率较低;而第二种算法\ textsc{Legendre-LSVI}则能在多项式时间内运行,尽管适用于更小类别的问题。在分析其遗憾性质后,我们将结果与强化学习理论中的最新成果进行了比较,表明我们的算法达到了最佳保证。