Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.
翻译:许多战略性决策问题(如仓库机器人的环境设计)可自然建模为双层强化学习,其中主导体优化自身目标的同时,追随者求解以主导体决策为条件的马尔可夫决策过程。在许多场景中,主导体无法干预追随者的优化过程,仅能观测优化结果这一根本性挑战由此产生。我们通过推导主导体目标的超梯度(即能反映追随者最优策略变化的梯度)来解决该去中心化设定。不同于现有基于超梯度的方法需要大量数据重复访问状态,或依赖梯度估计器(其复杂度可能随主导体高维决策空间显著增长),我们利用玻尔兹曼协方差技巧推导出替代性超梯度公式。这使得仅通过交互样本即可实现高效超梯度估计,即便在主导体决策空间为高维时依然有效。此外,据我们所知,这是首个能在去中心化设定下对双人马尔可夫博弈进行超梯度优化的方法。实验验证了超梯度更新的有效性,并展示了本方法在离散与连续状态任务中的性能。