Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
翻译:强化学习可为具有可变动态特性的序列决策问题提供有效推理。然而,在实际应用中,对奖励函数及相应最优策略的可解释性仍存在持续挑战。因此,将序列决策问题表示为概率推断具有重要价值,因为推断过程原则上能提供多样且强大的数学工具来推演随机动态特性,同时为策略优化提供概率化解释。本研究提出一种新型自适应Wasserstein变分优化方法(简称AWaVO)以应对这些可解释性挑战。该方法运用形式化方法实现收敛性保证、训练过程透明性及内在决策解释的可解释性。为验证其实用性,我们在仿真实验和实际四旋翼飞行器任务中展示了具有最优全局收敛速率的可解释性保证。通过与TRPO-IPO、PCPO和CRPO等前沿基准方法的对比实验,我们实证验证了AWaVO能在高性能与充分可解释性之间取得合理平衡。