Reinforcement Learning or optimal control can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and corresponding optimal policy. Consequently, formalizing the sequential decision-making problems as inference has a considerable value, as probabilistic inference in principle offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of the reward design and policy convergence. In this study, we propose a novel Adaptive Wasserstein Variational Optimization (AWaVO) to tackle these challenges in sequential decision-making. Our approach utilizes formal methods to provide interpretations of reward design, transparency of training convergence, and probabilistic interpretation of sequential decisions. To demonstrate practicality, we show convergent training with guaranteed global convergence rates not only in simulation but also in real robot tasks, and empirically verify a reasonable tradeoff between high performance and conservative interpretability.
翻译:强化学习或最优控制能够为具有动态变化特性的序列决策问题提供有效推理。然而在实际应用中,这种推理在解释奖励函数及相应最优策略时始终面临挑战。因此,将序列决策问题形式化为推断过程具有重要价值,因为概率推断原则上提供了一系列多样且强大的数学工具来推断随机动力学,同时为奖励设计和策略收敛提供概率解释。在本研究中,我们提出了一种新颖的自适应Wasserstein变分优化方法(AWaVO),以应对序列决策中的这些挑战。该方法利用形式化方法提供奖励设计的解释、训练收敛的透明度以及序列决策的概率解释。为证明其实用性,我们不仅在仿真环境中验证了具有保证全局收敛率的收敛训练,还在实际机器人任务中进行了验证,并通过实验证明在高性能与保守可解释性之间存在合理的权衡。