In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.
翻译:本文旨在解决多模态学习在视觉识别中的两大挑战:1)现实场景中训练或测试时出现模态缺失的情况;2)缺乏计算资源对重型Transformer模型进行微调。为此,我们提出利用提示学习方法来同时缓解上述两个问题。具体而言,我们设计的模态缺失感知提示可嵌入多模态Transformer中处理通用模态缺失情况,且仅需不到全模型训练1%的可学习参数。我们进一步探索了不同提示配置的效果,并分析了其对模态缺失的鲁棒性。大量实验表明,我们提出的提示学习框架能够有效提升多种模态缺失场景下的性能表现,同时减轻了对模型进行大规模重新训练的需求。代码已开源。