Reinforcement Learning (RL) provides a powerful framework for decision-making in complex environments. However, implementing RL in hardware-efficient and bio-inspired ways remains a challenge. This paper presents a novel Spiking Neural Network (SNN) architecture for solving RL problems with real-valued observations. The proposed model incorporates multi-layered event-based clustering, with the addition of Temporal Difference (TD)-error modulation and eligibility traces, building upon prior work. An ablation study confirms the significant impact of these components on the proposed model's performance. A tabular actor-critic algorithm with eligibility traces and a state-of-the-art Proximal Policy Optimization (PPO) algorithm are used as benchmarks. Our network consistently outperforms the tabular approach and successfully discovers stable control policies on classic RL environments: mountain car, cart-pole, and acrobot. The proposed model offers an appealing trade-off in terms of computational and hardware implementation requirements. The model does not require an external memory buffer nor a global error gradient computation, and synaptic updates occur online, driven by local learning rules and a broadcasted TD-error signal. Thus, this work contributes to the development of more hardware-efficient RL solutions.
翻译:强化学习(RL)为复杂环境中的决策提供了强大框架。然而,以硬件高效且受生物启发的方式实现RL仍是一个挑战。本文提出一种新型脉冲神经网络(SNN)架构,用于解决具有实值观测的RL问题。该模型在先前研究基础上,结合了多层基于事件的聚类,并融入了时序差分(TD)误差调制和资格迹。消融研究证实了这些成分对模型性能的显著影响。我们采用带资格迹的表格型演员-评论家算法和当前最先进的近端策略优化(PPO)算法作为基准。我们的网络在经典RL环境(山地车、倒立摆和双杆摆)中始终优于表格型方法,并成功发现稳定的控制策略。该模型在计算和硬件实现要求方面提供了有吸引力的折衷方案:无需外部存储缓冲或全局误差梯度计算,突触更新通过局部学习规则和广播式TD误差信号在线进行。因此,本研究有助于开发更具硬件效率的RL解决方案。