Reinforcement Learning or optimal control can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and corresponding optimal policy. Consequently, formalizing the sequential decision-making problems as inference has a considerable value, as probabilistic inference in principle offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of the reward design and policy convergence. In this study, we propose a novel Adaptive Wasserstein Variational Optimization (AWaVO) to tackle these challenges in sequential decision-making. Our approach utilizes formal methods to provide interpretations of reward design, transparency of training convergence, and probabilistic interpretation of sequential decisions. To demonstrate practicality, we show convergent training with guaranteed global convergence rates not only in simulation but also in real robot tasks, and empirically verify a reasonable tradeoff between high performance and conservative interpretability.
翻译:强化学习或最优控制能够为具有可变动态特性的序贯决策问题提供有效的推理机制。然而,在实际应用中对奖励函数及相应最优策略进行解释始终是亟待解决的难题。因此,将序贯决策问题形式化为推断过程具有重要价值,因为概率推断原则上提供了多样且强大的数学工具来推断随机动态特性,同时为奖励设计和策略收敛提供概率解释。本研究提出了一种新颖的自适应Wasserstein变分优化(AWaVO)方法,以应对序贯决策中的这些挑战。该方法利用形式化方法提供奖励设计的可解释性、训练收敛过程的透明度以及序贯决策的概率解释。为验证实用性,我们不仅在仿真环境中,更在真实机器人任务中展示了具有保证全局收敛率的收敛性训练,并通过实验验证了高性能与保守可解释性之间的合理权衡。