In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.
翻译:本文针对多模态学习在视觉识别中面临的两大挑战:1)现实场景中训练或测试阶段出现模态缺失的情况;2)计算资源不足以对重型Transformer模型进行微调。为此,我们提出利用提示学习来同时缓解上述两个难题。具体而言,我们设计的模态缺失感知提示可嵌入多模态Transformer中处理通用模态缺失情形,且仅需训练完整模型1%以下的可学习参数。我们进一步探索了不同提示配置的效果,并分析了其对模态缺失的鲁棒性。大量实验表明,本文提出的提示学习框架在各类模态缺失场景下能有效提升性能,同时减轻了对重型模型重新训练的需求。代码已开源。