Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.
翻译:许多战略决策问题(如仓库机器人的环境设计)可自然建模为双层强化学习(bi-level reinforcement learning, RL),其中领导者智能体优化其目标,而跟随者求解依赖于领导者决策的马尔可夫决策过程(Markov decision process, MDP)。在许多场景中,一个根本性挑战在于领导者无法干预跟随者的优化过程,仅能观测优化结果。我们通过推导领导者目标的超梯度(即考虑跟随者最优策略变化的领导者策略梯度)来解决这种去中心化设置。与既往依赖于重复状态访问需大量数据的超梯度方法,或梯度估计器复杂度随领导者高维决策空间显著增长的方法不同,我们利用玻尔兹曼协方差技巧推导出替代性超梯度公式。这使得即便在领导者决策空间高维的场景中,仅通过交互样本即可高效估计超梯度。此外,据我们所知,这是首个在去中心化设置下实现双人马尔可夫博弈中超梯度优化方法。实验凸显了超梯度更新的影响,并证明该方法在离散与连续状态任务中的有效性。