Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.
翻译:低秩自适应(LoRA)是一种主流的参数高效微调方法,用于使大语言模型(LLMs)适应下游任务。本文首先提出在混合存内计算(CIM)架构上部署LoRA微调后的LLMs(即将预训练权重存储在RRAM中,LoRA权重存储在SRAM中)。为了解决RRAM固有噪声导致的性能下降问题,我们设计了一种新颖的硬件感知低秩自适应(HaLoRA)方法,旨在通过对齐理想条件和噪声条件下的训练目标,训练出一个既鲁棒又准确的LoRA分支。在LLaMA 3.2 1B和3B模型上进行的微调实验证明了HaLoRA在多个推理任务上的有效性,在保持不同噪声水平下鲁棒性的同时,平均得分最高提升了22.7%。