HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method for adapting large language models (LLMs) to downstream tasks. Meanwhile, Compute-in-Memory (CIM) architectures demonstrate superior energy efficiency due to their array-level parallel in-memory computing designs. In this paper, we propose deploying the LoRA-finetuned LLMs on the hybrid CIM architecture (i.e., pretrained weights onto energy-efficient Resistive Random-Access Memory (RRAM) and LoRA branches onto noise-free Static Random-Access Memory (SRAM)), reducing the energy cost to about 3\% compared to the Nvidia A100 GPU. However, the inherent noise of RRAM on the saved weights leads to performance degradation, simultaneously. To address this issue, we design a novel Hardware-aware Low-rank Adaptation (HaLoRA) method. The key insight is to train a LoRA branch that is robust toward such noise and then deploy it on noise-free SRAM, while the extra cost is negligible since the parameters of LoRAs are much fewer than pretrained weights (e.g., 0.15\% for LLaMA-3.2 1B model). To improve the robustness towards the noise, we theoretically analyze the gap between the optimization trajectories of the LoRA branch under both ideal and noisy conditions and further design an extra loss to minimize the upper bound of this gap. Therefore, we can enjoy both energy efficiency and accuracy during inference. Experiments finetuning the Qwen and LLaMA series demonstrate the effectiveness of HaLoRA across multiple reasoning tasks, achieving up to \textbf{22.7} improvement in average score while maintaining robustness at various noise types and noise levels.

翻译：低秩适配（LoRA）是一种主流的参数高效微调方法，用于使大语言模型（LLM）适应下游任务。与此同时，存内计算（CIM）架构凭借其阵列级并行的存内计算设计，展现出卓越的能效。本文提出在混合CIM架构上部署LoRA微调后的LLM（即将预训练权重存储在能效高的阻变随机存取存储器（RRAM）上，而将LoRA分支存储在无噪声的静态随机存取存储器（SRAM）上），与Nvidia A100 GPU相比，可将能耗降低至约3%。然而，RRAM固有的权重存储噪声同时会导致模型性能下降。为解决此问题，我们设计了一种新颖的硬件感知低秩适配（HaLoRA）方法。其核心思想是训练一个对此类噪声具有鲁棒性的LoRA分支，然后将其部署在无噪声的SRAM上，而由于LoRA的参数远少于预训练权重（例如，LLaMA-3.2 1B模型中仅占0.15%），额外开销可忽略不计。为提升对噪声的鲁棒性，我们从理论上分析了LoRA分支在理想条件和含噪声条件下优化轨迹之间的差距，并进一步设计了一个额外的损失函数来最小化此差距的上界。因此，我们可以在推理过程中同时享受高能效和高精度。在Qwen和LLaMA系列模型上进行的微调实验证明了HaLoRA在多项推理任务中的有效性，其在保持对不同噪声类型和噪声水平鲁棒性的同时，平均得分最高提升了\textbf{22.7}。