Posterior Sampling for Reinforcement Learning (PSRL) is a well-known algorithm that augments model-based reinforcement learning (MBRL) algorithms with Thompson sampling. PSRL maintains posterior distributions of the environment transition dynamics and the reward function, which are intractable for tasks with high-dimensional state and action spaces. Recent works show that dropout, used in conjunction with neural networks, induces variational distributions that can approximate these posteriors. In this paper, we propose Event-based Variational Distributions for Exploration (EVaDE), which are variational distributions that are useful for MBRL, especially when the underlying domain is object-based. We leverage the general domain knowledge of object-based domains to design three types of event-based convolutional layers to direct exploration. These layers rely on Gaussian dropouts and are inserted between the layers of the deep neural network model to help facilitate variational Thompson sampling. We empirically show the effectiveness of EVaDE-equipped Simulated Policy Learning (EVaDE-SimPLe) on the 100K Atari game suite.
翻译:强化学习后验采样(PSRL)是一种著名算法,它通过汤普森采样增强了基于模型的强化学习(MBRL)算法。PSRL维护环境转移动态和奖励函数的后验分布,这些分布对于具有高维状态和动作空间的任务是难以处理的。最近的研究表明,与神经网络结合使用的dropout可以诱导出能够近似这些后验分布的变分分布。在本文中,我们提出了用于探索的基于事件的变分分布(EVaDE),这是一种对MBRL有用的变分分布,特别是在底层领域是基于对象的领域时。我们利用基于对象领域的一般领域知识,设计了三种类型的基于事件卷积层来引导探索。这些层依赖于高斯dropout,并插入深度神经网络模型的层之间,以帮助实现变分汤普森采样。我们通过实验展示了配备EVaDE的模拟策略学习(EVaDE-SimPLe)在100K Atari游戏套件上的有效性。