Most of the existing works for reinforcement learning (RL) with general function approximation (FA) focus on understanding the statistical complexity or regret bounds. However, the computation complexity of such approaches is far from being understood -- indeed, a simple optimization problem over the function class might be as well intractable. In this paper, we tackle this problem by establishing an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm and uses the measurement to guide exploration. For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $\propto\operatorname{poly}\log(K)$ times for running the RL algorithm for $K$ episodes while still achieving a small near-optimal regret bound. In contrast to existing approaches that update the policy for at least $\Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy. When applied to settings in \cite{wang2020reinforcement} or \cite{jin2021bellman}, we improve the overall time complexity by at least a factor of $K$. Finally, we show the generality of our online sub-sampling technique by applying it to the reward-free RL setting and multi-agent RL setting.
翻译:现有的大多数基于一般函数逼近(FA)的强化学习(RL)工作集中于理解统计复杂度或遗憾界。然而,这些方法的计算复杂度远未得到充分理解——事实上,函数类上的简单优化问题可能都是难以处理的。在本文中,我们通过建立一个高效的在线子采样框架来解决这一问题,该框架衡量RL算法收集的数据点的信息增益,并利用该度量指导探索。对于具有复杂度有界函数类的基于值的方法,我们证明:在运行RL算法K个回合的同时,策略仅需更新$\propto\operatorname{poly}\log(K)$次,即可获得接近最优的较小遗憾界。与现有方法至少需要更新策略$\Omega(K)$次相比,我们的方法大幅减少了求解策略时所需的优化调用次数。当应用于\cite{wang2020reinforcement}或\cite{jin2021bellman}中的设置时,我们将总体时间复杂度至少降低了K倍。最后,通过将在线子采样技术应用于无奖励RL设置和多智能体RL设置,我们证明了该技术的通用性。