Advanced biological intelligence learns efficiently from an information-rich stream of stimulus information, even when feedback on behaviour quality is sparse or absent. Such learning exploits implicit assumptions about task domains. We refer to such learning as Domain-Adapted Learning (DAL). In contrast, AI learning algorithms rely on explicit externally provided measures of behaviour quality to acquire fit behaviour. This imposes an information bottleneck that precludes learning from diverse non-reward stimulus information, limiting learning efficiency. We consider the question of how biological evolution circumvents this bottleneck to produce DAL. We propose that species first evolve the ability to learn from reward signals, providing inefficient (bottlenecked) but broad adaptivity. From there, integration of non-reward information into the learning process can proceed via gradual accumulation of biases induced by such information on specific task domains. This scenario provides a biologically plausible pathway towards bottleneck-free, domain-adapted learning. Focusing on the second phase of this scenario, we set up a population of NNs with reward-driven learning modelled as Reinforcement Learning (A2C), and allow evolution to improve learning efficiency by integrating non-reward information into the learning process using a neuromodulatory update mechanism. On a navigation task in continuous 2D space, evolved DAL agents show a 300-fold increase in learning speed compared to pure RL agents. Evolution is found to eliminate reliance on reward information altogether, allowing DAL agents to learn from non-reward information exclusively, using local neuromodulation-based connection weight updates only.
翻译:高级生物智能能够从信息丰富的刺激流中高效学习,即使关于行为质量的反馈稀疏或缺失。这种学习利用了任务领域的隐含假设,我们称之为领域适应学习(DAL)。相比之下,人工智能学习算法依赖外部显式提供的行为质量度量来获取适应性行为,这造成了信息瓶颈,阻碍了从多样化的非奖励刺激信息中学习,限制了学习效率。我们探讨了生物进化如何绕过这一瓶颈以产生DAL的问题。我们提出,物种首先进化出从奖励信号中学习的能力,提供低效(瓶颈化)但广泛的适应性。在此基础上,非奖励信息可通过其在特定任务领域上诱导的偏见的逐渐积累,整合到学习过程中。这一场景为走向无瓶颈、领域适应学习提供了生物学上合理的路径。聚焦于该场景的第二阶段,我们建立了一个神经网络种群,其奖励驱动学习建模为强化学习(A2C),并允许进化通过使用神经调制更新机制将非奖励信息整合到学习过程中来提高学习效率。在连续二维空间的导航任务中,进化后的DAL智能体比纯强化学习智能体显示出300倍的学习速度提升。研究发现,进化完全消除了对奖励信息的依赖,使DAL智能体能够仅基于局部神经调制的连接权重更新,完全从非奖励信息中学习。