Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
翻译:强化学习能够为具有动态变化特性的序贯决策问题提供有效推理。然而在实际应用中,解释奖励函数及相应最优策略始终是重要挑战。将序贯决策问题表示为概率推理具有重要价值,因为推理过程原则上能提供多样且强大的数学工具来推断随机动态特性,同时为策略优化提供概率解释。本研究提出了一种新型自适应Wasserstein变分优化方法(AWaVO)以应对这些可解释性挑战。该方法采用形式化方法实现收敛保证的可解释性、训练透明性及内在决策解释性。为验证其实用性,我们在仿真和实际四旋翼飞行器任务中展示了具有最优全局收敛速率的可解释性保障。与包括TRPO-IPO、PCPO和CRPO在内的最新基准方法相比,实验验证AWaVO能够实现高性能与充分可解释性之间的合理权衡。