Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.
翻译:奖励规范在强化学习中扮演核心角色,引导智能体的行为。为表达非马尔可夫奖励,研究者引入了奖励机等形式化方法,以捕捉对历史依赖关系。然而,传统奖励机缺乏对精确时序约束的建模能力,限制了其在时间敏感场景中的应用。本文提出时间奖励机(TRM),通过将时序约束融入奖励结构,扩展了奖励机的能力。TRM支持更具表达力的规范生成与可调节的奖励逻辑,例如对延迟施加惩罚、对及时行动授予奖励。我们研究了在数字与实时语义下,基于TRM学习最优策略的无模型强化学习框架(如表格型Q学习)。我们的算法通过时间自动机抽象将TRM融入学习过程,并采用反事实想象启发式方法,利用TRM结构特性改进搜索。实验表明,在主流强化学习基准测试中,该算法能学习到满足TRM时序约束的高奖励策略。此外,我们对比研究了不同TRM语义下的性能表现,并通过消融实验凸显了反事实想象方法的优势。