We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{\epsilon^2})$ trajectories within $H$ deployments to identify $\epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.
翻译:我们研究了在线性函数近似下,基于\emph{无奖励}探索设置的部署高效强化学习(RL)问题。这是一个具有实际动机的问题,因为在现实RL应用中部署新策略代价高昂。在特征维度为$d$、规划视界为$H$的线性MDP设定下,我们提出了一种新算法,该算法通过最多$H$次部署收集不超过$\widetilde{O}(\frac{d^2H^5}{\epsilon^2})$条轨迹,从而为任意(可能依赖于数据的)奖励函数选择识别出$\epsilon$-最优策略。据我们所知,我们的方法是首次同时实现最优部署复杂度和样本复杂度的最优$d$依赖,即使奖励函数事先已知。我们的新技术包括一种保持探索性的策略离散化方法以及一种广义G-最优实验设计,这些方法可能具有独立的研究价值。最后,我们分析了低自适应RL中与之相关的 regret 最小化问题,并给出了切换代价和批处理复杂度的信息论下界。