Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interacts with their environment. However, existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards, neglecting the stochastic dynamics of agent-environment interactions and the critical role of uncertainty quantification. Our research leverages the Kalman filtering paradigm to introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in Stochastic Gradient Markov Chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This convergence not only enables us to quantify uncertainties associated with the value function and model parameters but also allows us to monitor these uncertainties during policy updates throughout the training phase. The LKTD algorithm paves the way for more robust and adaptable reinforcement learning approaches.
翻译:强化学习(RL)通过创建与环境交互的智能体来处理序贯决策问题。然而,现有算法通常将这些视为静态问题,聚焦于模型参数的点估计以最大化期望回报,忽略了智能体-环境交互的随机动态特性以及不确定性量化的关键作用。本研究利用卡尔曼滤波范式,提出一种新颖且可扩展的采样算法——朗之万卡尔曼时序差分算法(Langevinized Kalman Temporal-Difference, LKTD),用于深度强化学习。该算法基于随机梯度马尔可夫链蒙特卡洛(SGMCMC)方法,能够高效地从深度神经网络参数的后验分布中抽取样本。在温和条件下,我们证明了LKTD算法生成的后验样本收敛于平稳分布。这一收敛性不仅使我们能够量化值函数及模型参数的不确定性,还允许我们在整个训练过程中监控策略更新期间这些不确定性的变化。LKTD算法为构建更鲁棒、更自适应的强化学习方法铺平了道路。