We propose a model-free reinforcement learning solution, namely the ASAP-Phi framework, to encourage an agent to fulfill a formal specification ASAP. The framework leverages a piece-wise reward function that assigns quantitative semantic reward to traces not satisfying the specification, and a high constant reward to the remaining. Then, it trains an agent with an actor-critic-based algorithm, such as soft actor-critic (SAC), or deep deterministic policy gradient (DDPG). Moreover, we prove that ASAP-Phi produces policies that prioritize fulfilling a specification ASAP. Extensive experiments are run, including ablation studies, on state-of-the-art benchmarks. Results show that our framework succeeds in finding sufficiently fast trajectories for up to 97\% test cases and defeats baselines.
翻译:我们提出了一种无模型强化学习解决方案,即ASAP-Phi框架,旨在激励智能体尽可能快地满足形式化规范。该框架采用分段奖励函数:对未满足规范的轨迹赋予定量语义奖励,而对剩余轨迹赋予高常数奖励。随后,我们使用基于演员-评论家(Actor-Critic)的算法(如软演员-评论家SAC或深度确定性策略梯度DDPG)对智能体进行训练。此外,我们证明了ASAP-Phi能产生优先满足规范且追求即时性的策略。在最新基准测试上开展了大量实验(包括消融研究),结果表明,我们的框架能在97%的测试用例中成功找到足够快速的轨迹,并优于基线方法。