Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $\beta$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
翻译:切换成本(即策略变更所需成本)是强化学习中除标准损失(或奖励)指标外的关键指标。然而,现有关于切换成本(系数$\beta$严格为正且与$T$无关)的研究主要集中于静态强化学习,其中假设损失分布在训练过程中是固定的,因此未考虑损失分布可能非平稳甚至具有对抗性的实际场景。尽管对抗性强化学习能更好地建模这类实际场景,但一个开放问题仍然存在:如何开发可证明高效的对抗性强化学习算法来处理切换成本?本文首次尝试解决该问题。首先,我们给出了一个遗憾下界,表明任何算法的遗憾必须大于$\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$,其中$T$、$S$、$A$和$H$分别为每个回合中的回合数、状态数、动作数和层数。该下界表明,由于对抗性强化学习中切换成本的根本挑战,静态强化学习(带切换成本)以及无切换成本的对抗性强化学习中所能达到的最佳遗憾(其对$T$的依赖为$\tilde{O}(\sqrt{T})$)已不再可行。此外,我们提出了两种新颖的降切换算法,其遗憾在转移动态已知时与下界匹配,在转移动态未知时与下界仅相差一个小因子$\tilde{O}( H^{1/3} )$。我们的遗憾分析证明了它们具有接近最优的性能。