Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order information, thus circumventing the assumption of lower-level convexity. This, remarkably, distinguishes our development of hyper-gradient from the general AID-based bilevel frameworks since we take advantage of the specific structure of RL problems. Moreover, we propose both model-based and model-free bilevel reinforcement learning algorithms, facilitated by access to the fully first-order hyper-gradient. Both algorithms are provable to enjoy the convergence rate $\mathcal{O}(\epsilon^{-1})$. To the best of our knowledge, this is the first time that AID-based bilevel RL gets rid of additional assumptions on the lower-level problem. In addition, numerical experiments demonstrate that the hyper-gradient indeed serves as an integration of exploitation and exploration.
翻译:双层强化学习(RL)具有两层相互交织的问题,近年来引起了越来越多的关注。然而,下层RL问题固有的非凸性成为开发双层优化方法的一个障碍。通过利用与正则化RL相关的定点方程,我们利用完全一阶信息刻画了超梯度,从而规避了对下层凸性的假设。值得注意的是,这使得我们的超梯度开发方法与一般的基于近似隐式微分(AID)的双层框架区分开来,因为我们利用了RL问题的特定结构。此外,我们提出了基于模型和无模型的双层强化学习算法,这得益于对完全一阶超梯度的获取。两种算法均被证明具有$\mathcal{O}(\epsilon^{-1})$的收敛速率。据我们所知,这是首次基于AID的双层RL方法摆脱了对下层问题的额外假设。此外,数值实验表明,超梯度确实起到了利用与探索相融合的作用。