Bilevel optimization has been recently applied to many machine learning tasks. However, their applications have been restricted to the supervised learning setting, where static objective functions with benign structures are considered. But bilevel problems such as incentive design, inverse reinforcement learning (RL), and RL from human feedback (RLHF) are often modeled as dynamic objective functions that go beyond the simple static objective structures, which pose significant challenges of using existing bilevel solutions. To tackle this new class of bilevel problems, we introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based (policy) gradient algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.
翻译:双层优化最近被应用于许多机器学习任务。然而,其应用范围一直局限于监督学习场景,主要考虑具有良性结构的静态目标函数。但诸如激励机制设计、逆强化学习(RL)和基于人类反馈的强化学习(RLHF)等双层问题,通常被建模为超越简单静态目标结构的动态目标函数,这对现有双层求解方法提出了重大挑战。为解决这类新型双层问题,我们首次提出了通过惩罚公式视角求解双层强化学习问题的原理性算法框架。我们对该问题的优化景观及其基于惩罚的(策略)梯度算法进行了理论研究。通过在斯塔克尔伯格马尔可夫博弈、人类反馈强化学习和激励机制设计中的仿真实验,我们验证了所提算法的有效性。