Recent advances in reinforcement learning (RL) have renewed interest in reward design for shaping agent behavior, but manually crafting reward functions is tedious and error-prone. A principled alternative is to specify behavioral requirements in a formal, unambiguous language and automatically compile them into learning objectives. $ω$-regular languages are a natural fit, given their role in formal verification and synthesis. However, most existing $ω$-regular RL approaches operate in an episodic, discounted setting with periodic resets, which is misaligned with $ω$-regular semantics over infinite traces. For continuing tasks, where the agent interacts with the environment over a single uninterrupted lifetime, the average-reward criterion is more appropriate. We focus on absolute liveness specifications, a subclass of $ω$-regular languages that cannot be violated by any finite prefix and thus aligns naturally with continuing interaction. We present the first model-free RL framework that translates absolute liveness specifications into average-reward objectives and enables learning in unknown communicating Markov decision processes (MDPs) without episodic resetting. We also introduce a reward structure for lexicographic multi-objective optimization: among policies that maximize the satisfaction probability of an absolute liveness specification, the agent maximizes an external average-reward objective. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full environment knowledge, enabling model-free learning. Experiments across several benchmarks show that the continuing, average-reward approach outperforms competing discount-based methods.
翻译:近年来强化学习(RL)领域的进展重新激发了通过奖励设计塑造智能体行为的兴趣,但手动构建奖励函数既繁琐又容易出错。一种更具原则性的替代方案是使用形式化、无歧义的语言描述行为需求,并将其自动编译为学习目标。鉴于ω-正则语言在形式化验证与综合中的作用,其成为自然的选择。然而,现有的大多数ω-正则RL方法均基于周期性重置的幕式折扣设定,这与无限轨迹上的ω-正则语义存在错位。对于持续型任务——即智能体在与环境的单次不间断交互中持续运作的场景——平均奖励准则更为适用。我们聚焦于绝对活性规约,这是ω-正则语言的一个子类,其性质无法被任何有限前缀违反,因而与持续交互自然契合。我们提出了首个无模型RL框架,该框架将绝对活性规约转化为平均奖励目标,并支持在未知的连通马尔可夫决策过程(MDP)中无需幕式重置即可进行学习。我们还引入了一种用于词典序多目标优化的奖励结构:在最大化绝对活性规约满足概率的策略中,智能体同时最大化一个外部平均奖励目标。我们的方法保证了在未知连通MDP中的收敛性,并支持无需完全环境知识的即时约简,从而实现无模型学习。在多个基准测试上的实验表明,这种持续型平均奖励方法优于基于折扣的竞争方法。