Analog In-Memory Computing (AIMC) offers a promising solution to the von Neumann bottleneck. However, deploying transformer models on AIMC remains challenging due to their inherent need for flexibility and adaptability across diverse tasks. For the benefits of AIMC to be fully realized, weights of static vector-matrix multiplications must be mapped and programmed to analog devices in a weight-stationary manner. This poses two challenges for adapting a base network to hardware and downstream tasks: (i) conventional analog hardware-aware (AHWA) training requires retraining the entire model, and (ii) reprogramming analog devices is both time- and energy-intensive. To address these issues, we propose Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA) training, a novel approach for efficiently adapting transformers to AIMC hardware. AHWA-LoRA training keeps the analog weights fixed as meta-weights and introduces lightweight external LoRA modules for both hardware and task adaptation. We validate AHWA-LoRA training on SQuAD v1.1 and the GLUE benchmark, demonstrate its scalability to larger models, and show its effectiveness in instruction tuning and reinforcement learning. We further evaluate a practical deployment scenario that balances AIMC tile latency with digital LoRA processing using optimized pipeline strategies, with RISC-V-based programmable multi-core accelerators. This hybrid architecture achieves efficient transformer inference with only a 4% per-layer overhead compared to a fully AIMC implementation.
翻译:模拟内存计算(AIMC)为冯·诺依曼瓶颈提供了有前景的解决方案。然而,由于Transformer模型在不同任务中固有的灵活性和适应性需求,将其部署至AIMC仍面临挑战。为充分利用AIMC的优势,静态向量-矩阵乘法的权重必须以权重静态方式映射并编程至模拟器件中。这给将基础网络适配至硬件及下游任务带来了两个挑战:(i) 传统模拟硬件感知(AHWA)训练需重新训练整个模型;(ii) 重新编程模拟器件既耗时又耗能。为解决这些问题,我们提出模拟硬件感知低秩适配(AHWA-LoRA)训练——一种高效将Transformer适配至AIMC硬件的新方法。AHWA-LoRA训练将模拟权重固定为元权重,并引入轻量级外部LoRA模块用于硬件与任务适配。我们在SQuAD v1.1和GLUE基准测试上验证了AHWA-LoRA训练,证明了其在更大模型上的可扩展性,并展示了其在指令微调与强化学习中的有效性。我们进一步评估了实际部署场景,通过优化的流水线策略平衡AIMC瓦片延迟与数字LoRA处理,并采用基于RISC-V的可编程多核加速器。该混合架构实现了高效的Transformer推理,相比纯AIMC实现,每层开销仅增加4%。