Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.
翻译:尽管深度强化学习(DRL)已取得诸多成功案例,但在安全关键场景中大规模部署通过先进技术习得的策略仍因其缺乏形式化保证而受阻。变分马尔可夫决策过程(VAE-MDP)作为离散潜空间模型,为从任意强化学习策略中蒸馏可形式化验证控制器提供了可靠框架。尽管其相关保证解决了满足性能与安全属性等实际关键问题,但VAE方法仍存在若干学习缺陷(后验坍塌、收敛缓慢、动态估计能力弱),这主要源于缺乏支撑潜空间优化的抽象与表征保证。我们提出Wasserstein自编码马尔可夫决策过程(WAE-MDP),该潜空间模型通过最小化执行原始策略与经形式化保证的蒸馏策略的智能体行为之间的最优传输惩罚项,解决了上述问题。本方法在蒸馏策略过程中同时获得双模拟保证,实现了抽象与表征模型质量的具体优化。实验表明,除蒸馏速度提升达10倍外,潜空间模型质量整体更优。我们还展示了基于潜空间的简单失效时间验证算法的实验结果。本方法能够支撑此类简单验证技术的事实,凸显了其实际应用价值。