Reward shaping in multi-agent reinforcement learning (MARL) for complex tasks remains a significant challenge. Existing approaches often fail to find optimal solutions or cannot efficiently handle such tasks. We propose HYPRL, a specification-guided reinforcement learning framework that learns control policies w.r.t. hyperproperties expressed in HyperLTL. Hyperproperties constitute a powerful formalism for specifying objectives and constraints over sets of execution traces across agents. To learn policies that maximize the satisfaction of a HyperLTL formula $\phi$, we apply Skolemization to manage quantifier alternations and define quantitative robustness functions to shape rewards over execution traces of a Markov decision process with unknown transitions. A suitable RL algorithm is then used to learn policies that collectively maximize the expected reward and, consequently, increase the probability of satisfying $\phi$. We evaluate HYPRL on a diverse set of benchmarks, including safety-aware planning, Deep Sea Treasure, and the Post Correspondence Problem. We also compare with specification-driven baselines to demonstrate the effectiveness and efficiency of HYPRL.
翻译:复杂任务的多智能体强化学习(MARL)中的奖励塑形仍是一个重大挑战。现有方法往往无法找到最优解,或不能高效处理此类任务。我们提出HYPRL,一种规范引导的强化学习框架,用于学习针对以HyperLTL表达的超属性的控制策略。超属性是一种强大的形式化方法,用于规定跨智能体的执行轨迹集合上的目标与约束。为学习能最大化满足HyperLTL公式$\phi$的策略,我们应用Skolem化处理量词交替,并定义量化鲁棒性函数,以在转移未知的马尔可夫决策过程的执行轨迹上塑形奖励。随后采用合适的强化学习算法来学习能共同最大化期望奖励的策略,从而提升满足$\phi$的概率。我们在多样化基准测试上评估HYPRL,包括安全感知规划、深海寻宝及波斯特对应问题,并与基于规范的基线方法进行比较,以证明HYPRL的有效性与高效性。