Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.
翻译:尽管深度强化学习(DRL)取得了诸多成功,但在安全关键场景中大规模部署通过这类先进技术习得的策略,仍因其缺乏形式化保障而受阻。变分马尔可夫决策过程(VAE-MDP)作为一种离散潜空间模型,为从任意强化学习策略中提取可形式化验证的控制器提供了可靠框架。虽然其相关保障涵盖了性能与安全性满足等实际应用要素,但VAE方法因缺乏支持潜优化的抽象与表征保障,导致若干学习缺陷(后验坍塌、学习速度缓慢、动态模型估计较差)。本文提出Wasserstein自编码马尔可夫决策过程(WAE-MDP),通过最小化原始策略执行者与经形式化保障的精简策略之间行为的最优运输的惩罚形式,有效解决了上述问题。我们的方法在学习精简策略的同时提供了双模拟保障,能够具体优化抽象与表征模型的质量。实验表明,该方法不仅将策略精简速度提升高达10倍,且潜模型质量在总体上更为优异。此外,本文展示了基于潜空间的简单失效时间验证算法的实验结果。该方法对这类简单验证技术的兼容性,进一步凸显了其实际应用价值。