Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

Recent advances in reinforcement learning (RL) have renewed interest in reward design for shaping agent behavior, but manually crafting reward functions is tedious and error-prone. A principled alternative is to specify behavioral requirements in a formal, unambiguous language and automatically compile them into learning objectives. $ω$-regular languages are a natural fit, given their role in formal verification and synthesis. However, most existing $ω$-regular RL approaches operate in an episodic, discounted setting with periodic resets, which is misaligned with $ω$-regular semantics over infinite traces. For continuing tasks, where the agent interacts with the environment over a single uninterrupted lifetime, the average-reward criterion is more appropriate. We focus on absolute liveness specifications, a subclass of $ω$-regular languages that cannot be violated by any finite prefix and thus aligns naturally with continuing interaction. We present the first model-free RL framework that translates absolute liveness specifications into average-reward objectives and enables learning in unknown communicating Markov decision processes (MDPs) without episodic resetting. We also introduce a reward structure for lexicographic multi-objective optimization: among policies that maximize the satisfaction probability of an absolute liveness specification, the agent maximizes an external average-reward objective. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full environment knowledge, enabling model-free learning. Experiments across several benchmarks show that the continuing, average-reward approach outperforms competing discount-based methods.

翻译：近年来强化学习（RL）领域的进展重新激发了通过奖励设计塑造智能体行为的兴趣，但手动构建奖励函数既繁琐又容易出错。一种更具原则性的替代方案是使用形式化、无歧义的语言描述行为需求，并将其自动编译为学习目标。鉴于ω-正则语言在形式化验证与综合中的作用，其成为自然的选择。然而，现有的大多数ω-正则RL方法均基于周期性重置的幕式折扣设定，这与无限轨迹上的ω-正则语义存在错位。对于持续型任务——即智能体在与环境的单次不间断交互中持续运作的场景——平均奖励准则更为适用。我们聚焦于绝对活性规约，这是ω-正则语言的一个子类，其性质无法被任何有限前缀违反，因而与持续交互自然契合。我们提出了首个无模型RL框架，该框架将绝对活性规约转化为平均奖励目标，并支持在未知的连通马尔可夫决策过程（MDP）中无需幕式重置即可进行学习。我们还引入了一种用于词典序多目标优化的奖励结构：在最大化绝对活性规约满足概率的策略中，智能体同时最大化一个外部平均奖励目标。我们的方法保证了在未知连通MDP中的收敛性，并支持无需完全环境知识的即时约简，从而实现无模型学习。在多个基准测试上的实验表明，这种持续型平均奖励方法优于基于折扣的竞争方法。