Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics- and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To test this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network's spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with an ensemble of probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.
翻译:概率动力学模型集成被广泛应用于现有基于模型的强化学习方法中,因其在渐近性能和样本效率上均优于单一动力学模型。本文通过Lipschitz连续性视角,从理论和实践层面揭示了概率动力学模型集成取得实证成功的深层原因。我们发现:对价值函数而言,其满足的Lipschitz条件越强,真实动力学与学习动力学诱导的贝尔曼算子之间的差距就越小,从而使得收敛后的价值函数更接近最优价值函数。据此我们提出假设:概率动力学模型集成的核心功能是通过生成样本对价值函数进行Lipschitz条件正则化。为验证该假设,我们设计了两种实用的鲁棒训练机制:通过计算对抗噪声和正则化价值网络的谱范数,直接约束价值函数的Lipschitz条件。实验结果表明,结合我们的机制后,使用单一动力学模型的强化学习算法性能优于采用概率动力学模型集成的算法。这些发现不仅验证了理论洞见,更为开发计算高效的基于模型强化学习算法提供了实用解决方案。