Aiming to accelerate the training of large deep neural networks (DNN) in an energy-efficient way, analog in-memory computing (AIMC) emerges as a solution with immense potential. AIMC accelerator keeps model weights in memory without moving them from memory to processors during training, reducing overhead dramatically. Despite its efficiency, scaling up AIMC systems presents significant challenges. Since weight copying is expensive and inaccurate, data parallelism is less efficient on AIMC accelerators. It necessitates the exploration of pipeline parallelism, particularly asynchronous pipeline parallelism, which utilizes all available accelerators during the training process. This paper examines the convergence theory of stochastic gradient descent on AIMC hardware with an asynchronous pipeline (Analog-SGD-AP). Although there is empirical exploration of AIMC accelerators, the theoretical understanding of how analog hardware imperfections in weight updates affect the training of multi-layer DNN models remains underexplored. Furthermore, the asynchronous pipeline parallelism results in stale weights issues, which render the update signals no longer valid gradients. To close the gap, this paper investigates the convergence properties of Analog-SGD-AP on multi-layer DNN training. We show that the Analog-SGD-AP converges with iteration complexity $O(\varepsilon^{-2}+\varepsilon^{-1})$ despite the aforementioned issues, which matches the complexities of digital SGD and Analog SGD with synchronous pipeline, except the non-dominant term $O(\varepsilon^{-1})$. It implies that AIMC training benefits from asynchronous pipelining almost for free compared with the synchronous pipeline by overlapping computation.
翻译:为以高能效方式加速大型深度神经网络(DNN)的训练,模拟内存计算(AIMC)作为一种具有巨大潜力的解决方案应运而生。AIMC加速器在训练期间将模型权重保存在内存中,无需将其从内存移至处理器,从而显著降低开销。尽管其效率高,但扩展AIMC系统仍面临重大挑战。由于权重复制成本高昂且不精确,数据并行在AIMC加速器上效率较低。这促使我们探索流水线并行,特别是异步流水线并行,它能在训练过程中充分利用所有可用加速器。本文研究了在配备异步流水线的AIMC硬件上随机梯度下降(Analog-SGD-AP)的收敛理论。尽管已有对AIMC加速器的实证探索,但关于权重更新中的模拟硬件缺陷如何影响多层DNN模型训练的理论理解仍显不足。此外,异步流水线并行会导致权重陈旧问题,使得更新信号不再构成有效梯度。为填补这一空白,本文研究了Analog-SGD-AP在多层DNN训练中的收敛特性。我们证明,尽管存在上述问题,Analog-SGD-AP仍能以迭代复杂度$O(\varepsilon^{-2}+\varepsilon^{-1})$收敛,这与数字SGD及采用同步流水线的模拟SGD的复杂度相匹配,仅多出一个非主导项$O(\varepsilon^{-1})$。这表明,通过计算重叠,AIMC训练几乎可以无代价地从异步流水线中获益,相较于同步流水线。