Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.
翻译:在强化学习中,如何设计信息丰富且密集的奖励函数仍然是一个关键挑战,因为它直接影响智能体训练的效率。本文利用有限迹上定量线性时序逻辑(($\text{LTL}_f[\mathcal{F}]$))的表达能力,合成为运行时可观测状态轨迹生成密集奖励流的奖励监控器。通过在训练过程中提供细致的反馈,这些监控器能引导智能体趋向最优行为,并有助于缓解当前文献中占主导地位的布尔语义下长时程决策中普遍存在的奖励稀疏问题。我们的框架与算法无关,仅依赖于状态标记函数,并能自然地支持非马尔可夫性质的描述。实验结果表明,我们的定量监控器不仅能完全覆盖布尔监控器的功能,而且在特定环境中,在最大化任务完成度的定量度量与减少收敛时间方面均表现出更优的性能。