State of the art deep reinforcement learning algorithms are sample inefficient due to the large number of episodes they require to achieve asymptotic performance. Episodic Reinforcement Learning (ERL) algorithms, inspired by the mammalian hippocampus, typically use extended memory systems to bootstrap learning from past events to overcome this sample-inefficiency problem. However, such memory augmentations are often used as mere buffers, from which isolated past experiences are drawn to learn from in an offline fashion (e.g., replay). Here, we demonstrate that including a bias in the acquired memory content derived from the order of episodic sampling improves both the sample and memory efficiency of an episodic control algorithm. We test our Sequential Episodic Control (SEC) model in a foraging task to show that storing and using integrated episodes as event sequences leads to faster learning with fewer memory requirements as opposed to a standard ERL benchmark, Model-Free Episodic Control, that buffers isolated events only. We also study the effect of memory constraints and forgetting on the sequential and non-sequential version of the SEC algorithm. Furthermore, we discuss how a hippocampal-like fast memory system could bootstrap slow cortical and subcortical learning subserving habit formation in the mammalian brain.
翻译:当前最先进的深度强化学习算法由于需要大量训练回合才能达到渐近性能,因此样本效率低下。受哺乳动物海马体启发的 情景强化学习算法通常采用扩展记忆系统,通过从过往事件中引导学习来克服样本效率低下的问题。然而,这类记忆增强机制往往仅被用作缓冲区,从中抽取孤立的过往经验以离线方式进行学习。本文证明,在获取的记忆内容中引入源自情景采样顺序的偏置,能够提升情景控制算法的样本效率与记忆效率。我们在觅食任务中测试了所提出的顺序情景控制模型,结果表明:与仅缓存孤立事件的标准情景强化学习基准相比,存储并使用整合后的事件序列作为情景片段,能够以更少的内存需求实现更快的学习。我们还研究了记忆限制与遗忘对顺序与非顺序版本SEC算法的影响。此外,我们探讨了海马体类快速记忆系统如何引导支持哺乳动物大脑习惯形成的慢速皮层及皮层下学习过程。