Recently, finetuning pretrained Vision-Language Models (VLMs) has been a prevailing paradigm for achieving state-of-the-art performance in Visual Question Answering (VQA). However, as VLMs scale, finetuning full model parameters for a given task in low-resource settings becomes computationally expensive, storage inefficient, and prone to overfitting. Current parameter-efficient tuning methods dramatically reduce the number of tunable parameters, but there still exists a significant performance gap with full finetuning. In this paper, we propose MixPHM, a redundancy-aware parameter-efficient tuning method that outperforms full finetuning in low-resource VQA. Specifically, MixPHM is a lightweight module implemented by multiple PHM-experts in a mixture-of-experts manner. To reduce parameter redundancy, MixPHM reparameterizes expert weights in a low-rank subspace and shares part of the weights inside and across experts. Moreover, based on a quantitative redundancy analysis for adapters, we propose Redundancy Regularization to reduce task-irrelevant redundancy while promoting task-relevant correlation in MixPHM representations. Experiments conducted on VQA v2, GQA, and OK-VQA demonstrate that MixPHM outperforms state-of-the-art parameter-efficient methods and is the only one consistently surpassing full finetuning.
翻译:近期,微调预训练的视觉-语言模型已成为在视觉问答任务中实现最优性能的主流范式。然而,随着视觉-语言模型的规模扩大,在低资源环境下为特定任务微调全部模型参数不仅计算成本高昂、存储效率低下,而且容易过拟合。当前参数高效微调方法虽能大幅减少可调参数数量,但始终与全参数微调存在显著性能差距。本文提出MixPHM方法,这是一种在低资源视觉问答中能够超越全参数微调的冗余感知参数高效微调方法。具体而言,MixPHM通过混合专家方式,由多个PHM专家构成轻量级模块。为降低参数冗余,MixPHM在低秩子空间中对专家权重进行重参数化,并在专家内部与专家之间实现部分权重共享。此外,基于对适配器的量化冗余分析,本文提出冗余正则化方法,旨在减少MixPHM表征中的任务无关冗余,同时增强任务相关关联。在VQA v2、GQA和OK-VQA数据集上的实验表明,MixPHM不仅超越现有最优参数高效方法,且是唯一在各项指标上持续超越全参数微调的方法。