The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
翻译:探索-利用困境一直是复杂模型类别下强化学习(RL)面临的核心挑战。本文针对通用函数逼近的RL问题,提出了一种新算法——单调Q学习上置信界算法(MQL-UCB)。我们的关键算法设计包括:(1)一种能实现低切换代价的通用确定性策略切换策略;(2)一种具有精细受控函数类复杂度的单调值函数结构;(3)一种利用历史轨迹实现高数据效率的方差加权回归方案。当回合数$K$足够大时,MQL-UCB实现了$\tilde{O}(d\sqrt{HK})$的极小化最优遗憾界,以及$\tilde{O}(dH)$的近似最优策略切换代价,其中$d$为函数类之eluder维数,$H$为规划时域长度,$K$为回合总数。本文为设计具有非线性函数逼近的、可证明的样本高效且部署高效的Q学习算法提供了理论启示。