Linear temporal logic (LTL) offers a simplified way of specifying tasks for policy optimization that may otherwise be difficult to describe with scalar reward functions. However, the standard RL framework can be too myopic to find maximally LTL satisfying policies. This paper makes two contributions. First, we develop a new value-function based proxy, using a technique we call eventual discounting, under which one can find policies that satisfy the LTL specification with highest achievable probability. Second, we develop a new experience replay method for generating off-policy data from on-policy rollouts via counterfactual reasoning on different ways of satisfying the LTL specification. Our experiments, conducted in both discrete and continuous state-action spaces, confirm the effectiveness of our counterfactual experience replay approach.
翻译:线性时态逻辑(LTL)为策略优化提供了一种简化的任务指定方式,而这些任务可能难以通过标量奖励函数描述。然而,标准强化学习框架可能过于短视,无法找到最大化满足LTL的策略。本文有两个贡献。首先,我们利用一种称为终局贴现的技术,开发了一种新的基于价值函数的代理方法,在该方法下,可以找到以最高可实现概率满足LTL规范的策略。其次,我们开发了一种新的经验回放方法,通过基于LTL规范不同满足方式的反事实推理,从策略内轨迹生成策略外数据。我们在离散和连续状态-动作空间中进行的实验证实了我们反事实经验回放方法的有效性。