In many real-life reinforcement learning (RL) problems, deploying new policies is costly. In those scenarios, algorithms must solve exploration (which requires adaptivity) while switching the deployed policy sparsely (which limits adaptivity). In this paper, we go beyond the existing state-of-the-art on this problem that focused on linear Markov Decision Processes (MDPs) by considering linear Bellman-complete MDPs with low inherent Bellman error. We propose the ELEANOR-LowSwitching algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$ and feature dimension $d$. We also prove a lower bound proportional to $dH$ among all algorithms with sublinear regret. In addition, we show the ``doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation, under which we design a sample-efficient algorithm with near-optimal switching cost.
翻译:在许多现实生活中的强化学习问题中,部署新策略的成本高昂。在此类场景中,算法必须在稀疏切换部署策略(限制适应性)的同时解决探索问题(需要适应性)。本文突破了现有研究聚焦线性马尔可夫决策过程(MDP)的局限,通过考虑具有低内在贝尔曼误差的线性贝尔曼完备MDP,提出了一种新方法。我们提出了ELEANOR-LowSwitching算法,该算法实现了近乎最优的遗憾值,其切换代价为回合数的对数级,且与时间跨度$H$和特征维度$d$呈线性关系。我们还证明了在所有具有次线性遗憾的算法中,存在一个与$dH$成比例的下界。此外,我们展示了ELEANOR-LowSwitching中使用的“倍增技巧”可进一步推广至广义线性函数逼近场景,并在此场景下设计了一种具有近似最优切换代价的样本高效算法。