With the rise of Visual and Language Pretraining (VLP), an increasing number of downstream tasks are adopting the paradigm of pretraining followed by fine-tuning. Although this paradigm has demonstrated potential in various multimodal downstream tasks, its implementation in the remote sensing domain encounters some obstacles. Specifically, the tendency for same-modality embeddings to cluster together impedes efficient transfer learning. To tackle this issue, we review the aim of multimodal transfer learning for downstream tasks from a unified perspective, and rethink the optimization process based on three distinct objectives. We propose "Harmonized Transfer Learning and Modality Alignment (HarMA)", a method that simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment, while minimizing training overhead through parameter-efficient fine-tuning. Remarkably, without the need for external data for training, HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the field of remote sensing. Our experiments reveal that HarMA achieves competitive and even superior performance to fully fine-tuned models with only minimal adjustable parameters. Due to its simplicity, HarMA can be integrated into almost all existing multimodal pretraining models. We hope this method can facilitate the efficient application of large models to a wide range of downstream tasks while significantly reducing the resource consumption. Code is available at https://github.com/seekerhuang/HarMA.
翻译:随着视觉-语言预训练(VLP)的兴起,越来越多的下游任务采用"预训练-微调"范式。尽管该范式在多模态下游任务中展现出潜力,其在遥感领域的应用仍面临障碍,主要表现为同模态嵌入的聚类倾向阻碍了高效迁移学习。针对这一问题,我们从统一视角审视多模态迁移学习在下游任务中的目标,基于三个不同目标重新思考优化过程。我们提出"融合迁移学习与模态对齐(HarMA)"方法,该方法能在通过参数高效微调最小化训练开销的同时,同步满足任务约束、模态对齐与单模态均匀对齐。值得注意的是,无需外部训练数据,HarMA便在遥感领域两项主流多模态检索任务中实现了最先进性能。实验表明,HarMA仅需少量可调参数即可达到甚至超越全量微调模型的效果。凭借其简洁性,HarMA可集成至几乎所有现有模态预训练模型。我们希望该方法能在显著降低资源消耗的同时,推动大模型高效应用于广泛的下游任务。代码已开源:https://github.com/seekerhuang/HarMA。