Recently, multimodal prompting, which introduces learnable missing-aware prompts for all missing modality cases, has exhibited impressive performance. However, it encounters two critical issues: 1) The number of prompts grows exponentially as the number of modalities increases; and 2) It lacks robustness in scenarios with different missing modality settings between training and inference. In this paper, we propose a simple yet effective prompt design to address these challenges. Instead of using missing-aware prompts, we utilize prompts as modality-specific tokens, enabling them to capture the unique characteristics of each modality. Furthermore, our prompt design leverages orthogonality between prompts as a key element to learn distinct information across different modalities and promote diversity in the learned representations. Extensive experiments demonstrate that our prompt design enhances both performance and robustness while reducing the number of prompts.
翻译:近年来,多模态提示学习通过为所有缺失模态情况引入可学习的缺失感知提示,展现出卓越的性能。然而,该方法面临两个关键问题:1)随着模态数量增加,提示数量呈指数级增长;2)在训练与推理阶段缺失模态设置不同的场景中缺乏鲁棒性。本文提出一种简单而有效的提示设计方案以应对这些挑战。不同于使用缺失感知提示,我们将提示作为模态特定的标记,使其能够捕捉每种模态的独特特征。此外,本方案利用提示间的正交性作为关键要素,学习不同模态间的区分性信息,并增强所学表征的多样性。大量实验表明,所提提示设计在减少提示数量的同时,提升了模型性能与鲁棒性。