A biologically plausible method for training an Artificial Neural Network (ANN) involves treating each unit as a stochastic Reinforcement Learning (RL) agent, thereby considering the network as a team of agents. Consequently, all units can learn via REINFORCE, a local learning rule modulated by a global reward signal, which aligns more closely with biologically observed forms of synaptic plasticity. However, this learning method tends to be slow and does not scale well with the size of the network. This inefficiency arises from two factors impeding effective structural credit assignment: (i) all units independently explore the network, and (ii) a single reward is used to evaluate the actions of all units. Accordingly, methods aimed at improving structural credit assignment can generally be classified into two categories. The first category includes algorithms that enable coordinated exploration among units, such as MAP propagation. The second category encompasses algorithms that compute a more specific reward signal for each unit within the network, like Weight Maximization and its variants. In this research report, our focus is on the first category. We propose the use of Boltzmann machines or a recurrent network for coordinated exploration. We show that the negative phase, which is typically necessary to train Boltzmann machines, can be removed. The resulting learning rules are similar to the reward-modulated Hebbian learning rule. Experimental results demonstrate that coordinated exploration significantly exceeds independent exploration in training speed for multiple stochastic and discrete units based on REINFORCE, even surpassing straight-through estimator (STE) backpropagation.
翻译:一种生物可解释的人工神经网络训练方法是将每个单元视为随机强化学习代理,从而将整个网络视为一个代理团队。因此,所有单元可通过REINFORCE算法进行学习——这是一种由全局奖励信号调节的局部学习规则,更符合生物观察到的突触可塑性形式。然而,这种学习方法通常速度较慢且难以随网络规模扩展。该低效性源于两个阻碍有效结构信用分配的因素:(i)所有单元独立探索网络;(ii)使用单一奖励评估所有单元的行为。据此,改进结构信用分配的方法大致可分为两类。第一类包含实现单元间协同探索的算法,例如MAP传播;第二类涵盖为网络中每个单元计算更特异性奖励信号的算法,如权值最大化及其变体。本研究报告聚焦于第一类方法,提出使用玻尔兹曼机或循环网络实现协同探索。我们证明训练玻尔兹曼机通常所需的负相位可以被移除,由此产生的学习规则与奖励调制的赫布学习规则相似。实验结果表明,基于REINFORCE的多随机离散单元训练中,协同探索在训练速度上显著超越独立探索,甚至优于直通估计器(STE)反向传播。