A computational problem in biological reward-based learning is how credit assignment is performed in the nucleus accumbens (NAc). Much research suggests that NAc dopamine encodes temporal-difference (TD) errors for learning value predictions. However, dopamine is synchronously distributed in regionally homogeneous concentrations, which does not support explicit credit assignment (like used by backpropagation). It is unclear whether distributed errors alone are sufficient for synapses to make coordinated updates to learn complex, nonlinear reward-based learning tasks. We design a new deep Q-learning algorithm, Artificial Dopamine, to computationally demonstrate that synchronously distributed, per-layer TD errors may be sufficient to learn surprisingly complex RL tasks. We empirically evaluate our algorithm on MinAtar, the DeepMind Control Suite, and classic control tasks, and show it often achieves comparable performance to deep RL algorithms that use backpropagation.
翻译:生物奖励学习中的一个计算问题是如何在伏隔核(NAc)中执行信用分配。大量研究表明,NAc多巴胺编码了用于学习价值预测的时间差分(TD)误差。然而,多巴胺以区域均匀浓度的方式同步分布,这并不支持显式的信用分配(如反向传播所用)。目前尚不清楚仅靠分布式误差是否足以使突触进行协调更新,从而学习复杂的非线性奖励学习任务。我们设计了一种新的深度Q学习算法——人工多巴胺(Artificial Dopamine),通过计算证明同步分布的逐层TD误差可能足以学习出人意料复杂的强化学习任务。我们在MinAtar、DeepMind控制套件和经典控制任务上对算法进行了实证评估,结果表明其性能通常可与使用反向传播的深度强化学习算法相媲美。