Recent advances in reinforcement learning (RL) have led to significant improvements in task performance. However, training neural networks in an RL regime is typically achieved in combination with backpropagation, limiting their applicability in resource-constrained environments or when using non-differentiable neural networks. While noise-based alternatives like reward-modulated Hebbian learning (RMHL) have been proposed, their performance has remained limited, especially in scenarios with delayed rewards, which require retrospective credit assignment over time. Here, we derive a novel noise-based learning rule that addresses these challenges. Our approach combines directional derivative theory with Hebbian-like updates to enable efficient, gradient-free learning in RL. It features stochastic noisy neurons which can approximate gradients, and produces local synaptic updates modulated by a global reward signal. Drawing on concepts from neuroscience, our method uses reward prediction error as its optimization target to generate increasingly advantageous behavior, and incorporates an eligibility trace to facilitate temporal credit assignment in environments with delayed rewards. Its formulation relies on local information alone, making it compatible with implementations in neuromorphic hardware. Experimental validation shows that our approach significantly outperforms RMHL and is competitive with BP-based baselines, highlighting the promise of noise-based, biologically inspired learning for low-power and real-time applications.
翻译:强化学习(RL)的最新进展已显著提升了任务性能。然而,在强化学习框架下训练神经网络通常需结合反向传播实现,这限制了其在资源受限环境或使用不可微分神经网络时的适用性。尽管已有基于噪声的替代方案(如奖励调制的赫布学习,RMHL)被提出,但其性能仍显不足,尤其是在奖励延迟的场景中——这类场景需要随时间进行回顾性信用分配。本文推导出一种新颖的基于噪声的学习规则,以应对这些挑战。我们的方法将方向导数理论与类赫布更新相结合,实现了强化学习中高效、无梯度的学习。该方法采用可近似梯度的随机噪声神经元,并通过全局奖励信号调制产生局部突触更新。借鉴神经科学的概念,本方法以奖励预测误差作为优化目标来生成日益有利的行为,并引入资格迹以促进延迟奖励环境中的时序信用分配。其公式仅依赖于局部信息,从而与神经形态硬件的实现兼容。实验验证表明,我们的方法显著优于RMHL,并与基于反向传播的基线方法性能相当,凸显了基于噪声、受生物启发的学习在低功耗与实时应用中的潜力。