The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.
翻译:大语言模型(LLMs)的高效部署需求推动了量化技术(降低推理成本)与参数高效微调(PEFT,减少训练开销)的发展。这促使了量化感知PEFT方法的出现,旨在生成既精确又高效的量化模型。在此背景下,微调前降低量化误差对于实现高模型精度至关重要。然而,现有基于低秩自适应的方法存在表征能力有限的问题。近期基于傅里叶相关变换(FT)的适配器虽比低秩适配器具有更强的表征能力,但直接将其集成到量化模型中往往导致误差降低效果不佳且计算开销增加。为克服这些限制,我们提出QWHA方法:该方法通过采用沃尔什-哈达玛变换(WHT)作为变换核,并结合融合自适应参数选择与数值优化的新型适配器初始化方案,将基于FT的适配器集成到量化模型中。我们证明QWHA能有效缓解量化误差并促进微调过程,其设计还显著降低了计算成本。实验结果表明,在低比特量化精度方面,QWHA始终优于基线方法,并较现有基于FT的适配器实现了显著的训练加速。代码已发布于https://github.com/vantaa89/qwha。