Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.
翻译:低秩自适应(LoRA)是一种主流的参数高效微调方法,用于使大语言模型(LLM)适应下游任务。本文首先提出在混合存内计算(CIM)架构上部署LoRA微调后的LLM(即将预训练权重置于RRAM,LoRA置于SRAM)。为应对RRAM固有噪声导致的性能下降,我们设计了一种新颖的硬件感知低秩自适应(HaLoRA)方法,旨在通过对齐理想与噪声条件下的训练目标,训练一个既鲁棒又准确的LoRA分支。在LLaMA 3.2 1B和3B模型上的微调实验表明,HaLoRA在多项推理任务中均表现出有效性,平均得分提升最高达22.7%,同时在不同噪声水平下保持鲁棒性。