The need for deep neural network (DNN) models with higher performance and better functionality leads to the proliferation of very large models. Model training, however, requires intensive computation time and energy. Memristor-based compute-in-memory (CIM) modules can perform vector-matrix multiplication (VMM) in situ and in parallel, and have shown great promises in DNN inference applications. However, CIM-based model training faces challenges due to non-linear weight updates, device variations, and low-precision in analog computing circuits. In this work, we experimentally implement a mixed-precision training scheme to mitigate these effects using a bulk-switching memristor CIM module. Lowprecision CIM modules are used to accelerate the expensive VMM operations, with high precision weight updates accumulated in digital units. Memristor devices are only changed when the accumulated weight update value exceeds a pre-defined threshold. The proposed scheme is implemented with a system-on-chip (SoC) of fully integrated analog CIM modules and digital sub-systems, showing fast convergence of LeNet training to 97.73%. The efficacy of training larger models is evaluated using realistic hardware parameters and shows that that analog CIM modules can enable efficient mix-precision DNN training with accuracy comparable to full-precision software trained models. Additionally, models trained on chip are inherently robust to hardware variations, allowing direct mapping to CIM inference chips without additional re-training.
翻译:随着对高性能、高功能性深度神经网络(DNN)模型需求的增长,超大规模模型日益普及。然而,模型训练需要耗费巨大的计算时间和能量。基于忆阻器的存内计算(CIM)模块能够原位并行执行向量-矩阵乘法(VMM),在DNN推理应用中展现出巨大潜力。然而,由于非线性的权重更新、器件变异以及模拟计算电路的低精度问题,基于CIM的模型训练面临诸多挑战。本工作中,我们实验性地实现了一种混合精度训练方案,利用体切换忆阻器CIM模块缓解上述影响。低精度CIM模块用于加速昂贵的VMM操作,而高精度权重更新累积在数字单元中完成。仅当累积的权重更新值超过预设阈值时,忆阻器器件才被改变。该方案通过集成模拟CIM模块与数字子系统的片上系统(SoC)实现,LeNet训练收敛速度达到97.73%。我们利用实际硬件参数评估了该方案在较大规模模型训练中的有效性,结果表明,模拟CIM模块能够实现高效的混合精度DNN训练,其准确率可与全精度软件训练模型相媲美。此外,片上训练的模型对硬件变异具有内在鲁棒性,可直接映射至CIM推理芯片,无需额外重新训练。