Bilevel optimization has been recently applied to many machine learning tasks. However, their applications have been restricted to the supervised learning setting, where static objective functions with benign structures are considered. But bilevel problems such as incentive design, inverse reinforcement learning (RL), and RL from human feedback (RLHF) are often modeled as dynamic objective functions that go beyond the simple static objective structures, which pose significant challenges of using existing bilevel solutions. To tackle this new class of bilevel problems, we introduce the first principled algorithmic framework for solving bilevel RL problems through the lens of penalty formulation. We provide theoretical studies of the problem landscape and its penalty-based (policy) gradient algorithms. We demonstrate the effectiveness of our algorithms via simulations in the Stackelberg Markov game, RL from human feedback and incentive design.
翻译:双层优化近年来被广泛应用于许多机器学习任务中。然而,其应用此前主要局限于监督学习场景,即考虑具有良性结构的静态目标函数。但诸如激励设计、逆强化学习(RL)以及基于人类反馈的强化学习(RLHF)等双层问题,通常被建模为超越简单静态目标结构的动态目标函数,这给现有双层优化方案的应用带来了重大挑战。针对这类新型双层问题,本文通过罚函数视角,首次提出了解决双层强化学习问题的原则性算法框架。我们对该问题的景观及其基于罚函数的(策略)梯度算法进行了理论研究。通过在Stackelberg马尔可夫博弈、基于人类反馈的强化学习及激励设计中的仿真实验,验证了所提算法的有效性。