Leveraging pre-trained visual language models has become a widely adopted approach for improving performance in downstream visual question answering (VQA) applications. However, in the specialized field of medical VQA, the scarcity of available data poses a significant barrier to achieving reliable model generalization. Numerous methods have been proposed to enhance model generalization, addressing the issue from data-centric and model-centric perspectives. Data augmentation techniques are commonly employed to enrich the dataset, while various regularization approaches aim to prevent model overfitting, especially when training on limited data samples. In this paper, we introduce a method that incorporates gradient-guided parameter perturbations to the visual encoder of the multimodality model during both pre-training and fine-tuning phases, to improve model generalization for downstream medical VQA tasks. The small perturbation is adaptively generated by aligning with the direction of the moving average gradient in the optimization landscape, which is opposite to the directions of the optimizer's historical updates. It is subsequently injected into the model's visual encoder. The results show that, even with a significantly smaller pre-training image caption dataset, our approach achieves competitive outcomes on both VQA-RAD and SLAKE datasets.
翻译:利用预训练视觉语言模型已成为提升下游视觉问答(VQA)应用性能的广泛采用方法。然而,在医学VQA这一专业领域,可用数据的稀缺性对实现可靠的模型泛化构成了重大障碍。现有多种方法从数据层面和模型层面出发,致力于增强模型泛化能力。数据增强技术常用于丰富数据集,而各类正则化方法则旨在防止模型过拟合,尤其在有限数据样本训练时。本文提出一种方法,在预训练和微调阶段向多模态模型的视觉编码器注入梯度引导的参数扰动,以提升下游医学VQA任务的模型泛化能力。该微小扰动通过沿优化景观中移动平均梯度方向自适应生成,该方向与优化器历史更新方向相反,随后被注入模型视觉编码器。结果表明,即使使用显著更小的预训练图像标题数据集,我们的方法在VQA-RAD和SLAKE数据集上均取得了竞争力结果。