Model-based reinforcement learning refers to a set of approaches capable of sample-efficient decision making, which create an explicit model of the environment. This model can subsequently be used for learning optimal policies. In this paper, we propose a temporal Gaussian Mixture Model composed of a perception model and a transition model. The perception model extracts discrete (latent) states from continuous observations using a variational Gaussian mixture likelihood. Importantly, our model constantly monitors the collected data searching for new Gaussian components, i.e., the perception model performs a form of structure learning (Smith et al., 2020; Friston et al., 2018; Neacsu et al., 2022) as it learns the number of Gaussian components in the mixture. Additionally, the transition model learns the temporal transition between consecutive time steps by taking advantage of the Dirichlet-categorical conjugacy. Both the perception and transition models are able to forget part of the data points, while integrating the information they provide within the prior, which ensure fast variational inference. Finally, decision making is performed with a variant of Q-learning which is able to learn Q-values from beliefs over states. Empirically, we have demonstrated the model's ability to learn the structure of several mazes: the model discovered the number of states and the transition probabilities between these states. Moreover, using its learned Q-values, the agent was able to successfully navigate from the starting position to the maze's exit.
翻译:基于模型的强化学习指一系列能够实现样本高效决策的方法,这些方法构建环境的显式模型。该模型随后可用于学习最优策略。本文提出一种由感知模型与转移模型构成的时序高斯混合模型。感知模型通过变分高斯混合似然从连续观测中提取离散(隐)状态。重要的是,本模型持续监控采集数据以搜索新的高斯分量,即感知模型在执行结构学习(Smith等人,2020;Friston等人,2018;Neacsu等人,2022)的过程中学习混合模型中高斯分量的数量。此外,转移模型利用狄利克雷-范畴共轭性学习连续时间步间的时序转移。感知模型与转移模型均能遗忘部分数据点,同时将其提供的信息整合至先验分布中,从而确保快速的变分推断。最终决策通过改进的Q学习实现,该算法能够从状态信念中学习Q值。实验证明,本模型能够学习多种迷宫的结构:模型成功发现了状态数量及状态间的转移概率。此外,智能体利用习得的Q值成功实现了从起始位置到迷宫出口的导航。