Reinforcement learning (RL) has achieved enormous progress in solving various sequential decision-making problems, such as control tasks in robotics. Since policies are overfitted to training environments, RL methods have often failed to be generalized to safety-critical test scenarios. Robust adversarial RL (RARL) was previously proposed to train an adversarial network that applies disturbances to a system, which improves the robustness in test scenarios. However, an issue of neural network-based adversaries is that integrating system requirements without handcrafting sophisticated reward signals are difficult. Safety falsification methods allow one to find a set of initial conditions and an input sequence, such that the system violates a given property formulated in temporal logic. In this paper, we propose falsification-based RARL (FRARL): this is the first generic framework for integrating temporal logic falsification in adversarial learning to improve policy robustness. By applying our falsification method, we do not need to construct an extra reward function for the adversary. Moreover, we evaluate our approach on a braking assistance system and an adaptive cruise control system of autonomous vehicles. Our experimental results demonstrate that policies trained with a falsification-based adversary generalize better and show less violation of the safety specification in test scenarios than those trained without an adversary or with an adversarial network.
翻译:强化学习在解决各类序列决策问题(如机器人控制任务)中取得了巨大进展。由于策略过度拟合训练环境,强化学习方法常难以泛化至安全关键的测试场景。此前提出的鲁棒对抗强化学习(RARL)通过训练一个对系统施加扰动的对抗网络,提升了测试场景下的鲁棒性。然而,基于神经网络的对抗器存在一个难题:如何在不手工设计复杂奖励信号的情况下整合系统需求。安全性证伪方法能够找到一组初始条件和输入序列,使得系统违反以时序逻辑形式给出的特定属性。本文提出基于证伪的鲁棒对抗强化学习(FRARL):这是首个将时序逻辑证伪整合到对抗学习中以提升策略鲁棒性的通用框架。通过应用我们的证伪方法,无需为对抗器构建额外的奖励函数。此外,我们在自动驾驶车辆的制动辅助系统和自适应巡航控制系统上评估了该方法。实验结果表明,与未使用对抗器或使用对抗网络训练的策略相比,采用基于证伪的对抗器训练的策略在测试场景中具有更好的泛化能力,且违反安全规范的情况更少。