Low-rank adaptation of large models, particularly LoRA, has gained traction due to its computational efficiency. This efficiency, contrasted with the prohibitive costs of full-model fine-tuning, means that practitioners often turn to LoRA and sometimes without a complete understanding of its ramifications. In this study, we focus on fairness and ask whether LoRA has an unexamined impact on utility, calibration, and resistance to membership inference across different subgroups (e.g., genders, races, religions) compared to a full-model fine-tuning baseline. We present extensive experiments across vision and language domains and across classification and generation tasks using ViT-Base, Swin-v2-Large, Llama-2 7B, and Mistral 7B. Intriguingly, experiments suggest that while one can isolate cases where LoRA exacerbates model bias across subgroups, the pattern is inconsistent -- in many cases, LoRA has equivalent or even improved fairness compared to the base model or its full fine-tuning baseline. We also examine the complications of evaluating fine-tuning fairness relating to task design and model token bias, calling for more careful fairness evaluations in future work.
翻译:大型模型的低秩自适应方法,尤其是LoRA,因其计算效率而受到广泛关注。相较于全模型微调的高昂成本,这种效率优势意味着实践者常采用LoRA方法,但有时并未完全理解其潜在影响。本研究聚焦于公平性问题,探讨与全模型微调基线相比,LoRA是否对不同子群体(如性别、种族、宗教)的模型效用、校准性及成员推理攻击抵抗力存在未被充分考察的影响。我们通过在视觉与语言领域,使用ViT-Base、Swin-v2-Large、Llama-2 7B及Mistral 7B模型,在分类与生成任务上开展了大量实验。有趣的是,实验表明:虽然可以找到LoRA加剧模型子群体偏差的个别案例,但这种模式并不一致——在许多情况下,LoRA相较于基础模型或其全微调基线具有相当甚至更优的公平性表现。我们还探讨了任务设计与模型词元偏差对微调公平性评估带来的复杂性,呼吁未来研究需进行更审慎的公平性评估。