We study reward-free reinforcement learning (RL) with linear function approximation, where the agent works in two phases: (1) in the exploration phase, the agent interacts with the environment but cannot access the reward; and (2) in the planning phase, the agent is given a reward function and is expected to find a near-optimal policy based on samples collected in the exploration phase. The sample complexities of existing reward-free algorithms have a polynomial dependence on the planning horizon, which makes them intractable for long planning horizon RL problems. In this paper, we propose a new reward-free algorithm for learning linear mixture Markov decision processes (MDPs), where the transition probability can be parameterized as a linear combination of known feature mappings. At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward and a high-order moment estimator for the aleatoric and epistemic uncertainties. When the total reward is bounded by $1$, we show that our algorithm only needs to explore $\tilde O( d^2\varepsilon^{-2})$ episodes to find an $\varepsilon$-optimal policy, where $d$ is the dimension of the feature mapping. The sample complexity of our algorithm only has a polylogarithmic dependence on the planning horizon and therefore is ``horizon-free''. In addition, we provide an $\Omega(d^2\varepsilon^{-2})$ sample complexity lower bound, which matches the sample complexity of our algorithm up to logarithmic factors, suggesting that our algorithm is optimal.
翻译:我们研究具有线性函数逼近的免奖励强化学习(RL),其中代理在两个阶段工作:(1)在探索阶段,代理与环境交互但不能访问奖励;(2)在规划阶段,代理获得一个奖励函数,并期望基于探索阶段收集的样本找到接近最优的策略。现有免奖励算法的样本复杂度对规划水平具有多项式依赖性,这使得它们对于长规划水平RL问题难以处理。在本文中,我们提出了一种新的免奖励算法,用于学习线性混合马尔可夫决策过程(MDP),其中转移概率可以参数化为已知特征映射的线性组合。我们算法的核心是基于不确定性加权的值目标回归,结合探索驱动的伪奖励以及用于认知和偶然不确定性的高阶矩估计器。当总奖励被$1$界定时,我们表明我们的算法仅需探索$\tilde O( d^2\varepsilon^{-2})$个回合即可找到一个$\varepsilon$-最优策略,其中$d$是特征映射的维度。我们算法的样本复杂度对规划水平仅具有多对数依赖性,因此是“无水平依赖的”。此外,我们提供了$\Omega(d^2\varepsilon^{-2})$的样本复杂度下界,这与我们算法的样本复杂度在对数因子内匹配,表明我们的算法是最优的。