In this work, we propose a novel framework for the logical specification of non-Markovian rewards in Markov Decision Processes (MDPs) with large state spaces. Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT), a more expressive extension of classical temporal logic in which predicates are first-order formulas of arbitrary first-order theories rather than simple Boolean variables. This enhanced expressiveness enables the specification of complex tasks over unstructured and heterogeneous data domains, promoting a unified and reusable framework that eliminates the need for manual predicate encoding. However, the increased expressive power of LTLfMT introduces additional theoretical and computational challenges compared to standard LTLf specifications. We address these challenges from a theoretical standpoint, identifying a fragment of LTLfMT that is tractable but sufficiently expressive for reward specification in an infinite-state-space context. From a practical perspective, we introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity. We evaluate this approach to a continuous-control setting using Non-Linear Arithmetic Theory, showing that it enables natural specification of complex tasks. Experimental results show how a tailored implementation of HER is fundamental in solving tasks with complex goals.
翻译:本研究提出了一种新颖的框架,用于在具有大规模状态空间的马尔可夫决策过程(MDPs)中规范非马尔可夫奖励。我们的方法利用有限迹上的线性时序逻辑模理论(LTLfMT),这是经典时序逻辑的一种更具表达力的扩展,其中谓词是任意一阶理论的一阶公式,而非简单的布尔变量。这种增强的表达力使得能够在非结构化和异构数据域上规范复杂任务,从而形成一个统一且可复用的框架,无需手动进行谓词编码。然而,与标准的LTLf规范相比,LTLfMT增强的表达力也带来了额外的理论和计算挑战。我们从理论角度应对这些挑战,识别出LTLfMT的一个可处理片段,该片段在无限状态空间背景下对于奖励规范具有足够的表达力。从实践角度,我们引入了一种基于奖励机器和事后经验回放(HER)的方法,以转化一阶逻辑规范并应对奖励稀疏性问题。我们使用非线性算术理论在连续控制环境中评估了该方法,结果表明它能够自然地规范复杂任务。实验结果显示,一个量身定制的HER实现对于解决具有复杂目标的任务至关重要。