Parameter-efficient fine-tuning (PEFT) methods have emerged to mitigate the prohibitive cost of full fine-tuning large language models (LLMs). Nonetheless, the enormous size of LLMs impedes routine deployment. To address the issue, we present Parameter-Efficient and Quantization-aware Adaptation (PEQA), a novel quantization-aware PEFT technique that facilitates model compression and accelerates inference. PEQA operates through a dual-stage process: initially, the parameter matrix of each fully-connected layer undergoes quantization into a matrix of low-bit integers and a scalar vector; subsequently, fine-tuning occurs on the scalar vector for each downstream task. Such a strategy compresses the size of the model considerably, leading to a lower inference latency upon deployment and a reduction in the overall memory required. At the same time, fast fine-tuning and efficient task switching becomes possible. In this way, PEQA offers the benefits of quantization, while inheriting the advantages of PEFT. We compare PEQA with competitive baselines in comprehensive experiments ranging from natural language understanding to generation benchmarks. This is done using large language models of up to $65$ billion parameters, demonstrating PEQA's scalability, task-specific adaptation performance, and ability to follow instructions, even in extremely low-bit settings.
翻译:参数高效微调方法应运而生,旨在缓解全参数微调大型语言模型的高昂成本。然而,语言模型的庞大规模阻碍了其常规部署。为解决此问题,我们提出参数高效量化感知自适应方法(PEQA),这是一种新颖的量化感知参数高效微调技术,可促进模型压缩并加速推理。PEQA通过两阶段流程运行:首先,每个全连接层的参数矩阵被量化为低比特整数矩阵和标量向量;随后,针对每个下游任务对标量向量进行微调。该策略显著压缩模型规模,降低部署时的推理延迟并减少总体内存需求,同时实现快速微调和高效任务切换。通过这种方式,PEQA既保留了量化的优势,又继承了参数高效微调的特点。我们在从自然语言理解到生成基准的全面实验中,使用参数规模高达650亿的大型语言模型,将PEQA与竞争性基线方法进行对比,证明了其在超低比特设置下的可扩展性、任务特异性适应能力以及指令遵循能力。