With an ever-growing number of parameters defining increasingly complex networks, Deep Learning has led to several breakthroughs surpassing human performance. As a result, data movement for these millions of model parameters causes a growing imbalance known as the memory wall. Neuromorphic computing is an emerging paradigm that confronts this imbalance by performing computations directly in analog memories. On the software side, the sequential Backpropagation algorithm prevents efficient parallelization and thus fast convergence. A novel method, Direct Feedback Alignment, resolves inherent layer dependencies by directly passing the error from the output to each layer. At the intersection of hardware/software co-design, there is a demand for developing algorithms that are tolerable to hardware nonidealities. Therefore, this work explores the interrelationship of implementing bio-plausible learning in-situ on neuromorphic hardware, emphasizing energy, area, and latency constraints. Using the benchmarking framework DNN+NeuroSim, we investigate the impact of hardware nonidealities and quantization on algorithm performance, as well as how network topologies and algorithm-level design choices can scale latency, energy and area consumption of a chip. To the best of our knowledge, this work is the first to compare the impact of different learning algorithms on Compute-In-Memory-based hardware and vice versa. The best results achieved for accuracy remain Backpropagation-based, notably when facing hardware imperfections. Direct Feedback Alignment, on the other hand, allows for significant speedup due to parallelization, reducing training time by a factor approaching N for N-layered networks.
翻译:随着定义日益复杂网络的参数数量不断增长,深度学习已在多个领域取得突破性进展,超越人类表现。然而,数百万模型参数的数据传输导致不断加剧的失衡问题,即所谓的内存墙。神经形态计算作为一种新兴范式,通过在模拟存储器中直接执行计算来应对这种失衡。在软件层面,传统的反向传播算法阻碍了高效并行化及快速收敛。一种名为直接反馈对齐的新方法通过将误差从输出层直接传递至每一层,解决了固有的层级依赖问题。在硬件/软件协同设计的交叉领域,迫切需要开发能容忍硬件非理想特性的算法。因此,本文探索了在神经形态硬件上原位实现生物合理学习的内在关联,重点考虑能耗、面积和延迟约束。我们利用基准测试框架DNN+NeuroSim,研究了硬件非理想特性及量化对算法性能的影响,以及网络拓扑和算法级设计选择如何缩放芯片的延迟、能耗和面积消耗。据我们所知,本文首次比较了不同学习算法对基于存内计算硬件的影响,反之亦然。在精度方面,最优结果仍来自基于反向传播的方法,尤其在面对硬件缺陷时表现突出。另一方面,直接反馈对齐通过并行化实现了显著加速,使得N层网络的训练时间降低至接近于单层网络所需时间的1/N。